How "lock add" is implemented on x86 processors - c++

I recently benchmarked std::atomic::fetch_add vs std::atomic::compare_exchange_strong on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add is implemented with a lock add and std::atomic::compare_exchange_strong is implemented with lock cmpxchg (https://godbolt.org/z/qfo4an).
What makes lock add so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?
This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.

lock add and lock cmpxchg both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg and lock add are very similar numbers of uops from microcode. (Although lock add is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.
You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected. This is not the usual use-case for CAS retry loops).
Or does your CAS version actually do a pure load from the atomic variable to get an expected value? That might be leading to memory-order mis-speculation.
You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads that could record number of locked instructions executed.
With lock add, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.
It's plausible that HW arbitration could treat lock add and lock cmpxchg differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add instructions.
Is that what you mean?
Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?

to execute both instructions with sequential consistency, the
executing CPU has to pull the line into it's own core in exclusive or
modified mode (from MESI).
No, to execute either instruction with any consistent, defined semantic that guarantees that concurrent executions on multiple CPU do not lose increments, you would need that. Even if you were willing to drop "sequential consistency" (on these instructions) or even drop the usual acquire and release guarantees of reads and writes.
Any locked instruction effectively enforces mutual exclusion on the part of memory sufficient to guarantee atomicity. (Like a regular mutex but at the memory level.) Because no other core can access that memory range for the duration of the operation, the atomicity is trivially guaranteed.
What makes lock add so much faster on an intel multi-core processor?
I would expect any tiny difference of timing to be critical in these cases, and doing the load plus compare (or compare-load plus compare-load ...) might change the timing enough to lose the chance, much like too code using mutexes can have widely different efficiency when there is heavy contention and a small change in access pattern changes the way the mutex is attributed.

Related

Semantics of atomic stores in MESI cachelines

In a concurrently read and written to line (reads and stores only). What happens when a line is owned by a core in modified-or-read mode, and some other core issues store operations on this line (assuming these reads and writes are std::atomic::load and std::atomic::store with C++ compiilers)? Does the line get pulled into the other core that is issuing the writes? Or do the writes find their way over to the reading core directly as needed? The difference between the two is that the latter only causes the overhead of one roundtrip for reading the value of the line. And can possibly get paralellized as well (if the store and read happen at staggered points in time)
This question arose when considering the consequences of NUMA in a concurrent application. But the question stands when the two cores involved are in the same NUMA node.
There are a large number of architectures in the mix. But for now, curious about what happens on Intel Skylake or Broadwell.
First of all, there's nothing special about atomic loads/stores vs. regular stores by the time they're compiled to asm. (Although the default seq_cst memory order can compile to xchg, but mov+mfence is also a valid (often slower) option which is indistinguishable in asm from a plain release store followed by a full barrier.) xchg is an atomic RMW + a full barrier. Compilers use it for the full-barrier effect; the load part of the exchange is just an unwanted side-effect.
The rest of this answer applies fully to any x86 asm store, or the store part of a memory-destination RMW instruction (whether it's atomic or not).
Initially the core that had previously been doing writes will have the line in MESI Modified state in its L1d, assuming it hasn't been evicted to L2 or L3 already.
The line changes MESI state (to shared) in response to a read request, or for stores the core doing the write will send an RFO (request for ownership) and eventually get the line in Modified or Exclusive state.
Getting data between physical cores on modern Intel CPUs always involves write-back to shared L3 (not necessarily to DRAM). I think this is true even on multi-socket systems where the two cores are on separate sockets so they don't really share a common L3, using snooping (and snoop filtering).
Intel uses MESIF. AMD uses MOESI which allows sharing dirty data directly between cores directly without write-back to/from a shared outer level cache first.
For more details, see Which cache mapping technique is used in intel core i7 processor?
There's no way for store-data to reach another core except through cache/memory.
Your idea about the writes "happening" on another core is not how anything works. I don't see how it could even be implemented while respecting x86 memory ordering rules: stores from one core become globally visible in program order. I don't see how you could send stores (to different cache lines) to different cores and make sure one core waited for the other to commit those stores to the cache lines they each owned.
It's also not really plausible even for a weakly-ordered ISA. Often when you read or write a cache line, you're going to do more reads+writes. Sending each read or write request separately over a mesh interconnect between cores would require many many tiny messages. High throughput is much easier to achieve than low latency: wider buses can do that. Low latency for loads is essential for high performance. If threads ever migrate between cores, all of a sudden they'll be read/writing cache lines that are all hot in L1d on some other core, which would be horrible until the CPU somehow decided that it should migrate the cache line to the core accessing it.
L1d caches are small, fast, and relatively simple. The logic for ordering a core's reads+writes relative to each other, and for doing speculative loads, is all internal to a single core. (Store buffer, or on Intel actually a Memory Order Buffer to track speculative loads as well as stores.)
This is why you should avoid even touching a shared variable if you can prove you don't have to. (Or use exponential backoff for cases where that's appropriate). And why a CAS loop should spin read-only waiting to see the value its looking for before even attempting a CAS, instead of hammering on the cache line with writes from failing lock cmpxchg attempts.

Why does atomic operation need exclusive cache access?

In my understanding atomic operation (c++ atomic for example) first locks the cache line and then performs atomic operation. I have two questions: 1. if let's say atomic compare and swap is atomic operation itself in hardware why we need to lock cache line and 2. when cache line is locked how another cpu is waiting for it? does it use spin-lock style waiting?
thanks
First of all: It depends!
1.) If a system locks a cache line has nothing to do with c++. It is a question how a cache is organized and especially how assembler instructions acts with cache. That is a question for cpu architecture!
2.) How a compiler performs a atomic operation is implementation depended. Which assembler instructions will be generated to perform a atomic operation can vary from compiler to compiler and even on different versions.
3.) As I know, a full lock of a cache line is only the fall back solution if no "more clever" notification/synchronization of other cores accessing the same cache lines can be performed. But there are not only a single cache involved typically. Think of multi level cache architecture. Some caches are only visible to a single core! So there is a need of performing also more memory system operations as locking a line. You also have to move data from different cache levels also if multiple cores are involved!
4.) From the c++ perspective, a atomic operation is not only a single operation. What really will happen depends on memory ordering options for the atomic operation. As atomic operations often used for inter thread synchronization, a lot more things must be done for a single atomic RMW operation! To get an idea what all has to be done you should give https://www.cplusplusconcurrencyinaction.com/ a chance. It goes into the details of memory barriers and memory ordering.
5.) Locking a cache line ( if this really happens ) should not result in spin locks or other things on other cores as the access for the cache line itself took only some clock cycles. Depending on the architecture it simply "holds" the other core for some cycles. It may happen that the "sleeping" core can do in parallel other things in a different pipe. But hey, that is very hardware specific.
As already given as a comment: Take a look on https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/, it gives some hints what can happen with cache coherency and locking.
There is much more than locking under the hood. I believe your question scratches only on the surface!
For practical usage: Don't think about! Compiler vendors and cpu architects have done a very good job. You as a programmer should measure your code performance. From my perspective: No need to think about of what happens if cache lines are locked. You have to write good algorithms and think about good memory organization of your program data and less interrelationships between threads.

Is x86 CMPXCHG atomic, if so why does it need LOCK?

The Intel documentation says
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.
My question is
Can CMPXCHG operate with memory address? From the document it seems not but can anyone confirm that only works with actual VALUE in registers, not memory address?
If CMPXCHG isn't atomic and a high level language level CAS has to be implemented through LOCK CMPXCHG (with LOCK prefix), what's the purpose of introducing such an instruction at all?
(I am asking from a high level language perspective. I.e., if the lock-free algorithm has to be translated into a LOCK CMPXCHG on the x86 platform, then it's still prefixed with LOCK. That means the lock-free algorithms are not better than ones with a carefully written synchronized lock / mutex (on x86 at least). This also seems to make the naked CMPXCHG instruction pointless, as I guess the major point for introducing it, was to support such lock-free operations.)
It seems like part what you're really asking is:
Why isn't the lock prefix implicit for cmpxchg with a memory operand, like it is for xchg (since 386)?
The simple answer (that others have given) is simply that Intel designed it this way. But this leads to the question:
Why did Intel do that? Is there a use-case for cmpxchg without lock?
On a single-CPU system, cmpxchg is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg was relevant even on uniprocessor CPU designs).
Context switches can only happen on interrupts, and interrupts happen before or after an instruction, not in the middle. Any code running on the same CPU will see the cmpxchg as either fully executed or not at all.
For example, the Linux kernel is normally compiled with SMP support, so it uses lock cmpxchg for atomic CAS. But when booted on a single-processor system, it will patch the lock prefix to a nop everywhere that code was inlined, since nop cmpxchg runs much faster than lock cmpxchg. For more info, see this LWN article about Linux's "SMP alternatives" system. It can even patch back to lock prefixes before hot-plugging a second CPU.
Read more about atomicity of single instructions on uniprocessor systems in this answer, and in #supercat's answer + comments on Can num++ be atomic for int num. See my answer there for lots of details about how atomicity really works / is implemented for read-modify-write instructions like lock cmpxchg.
(This same reasoning also applies to cmpxchg8b / cmpxchg16b, and xadd, which are usually only used for synchonization / atomic ops, not to make single-threaded code run faster. Of course memory-destination instructions like add [mem], reg are useful outside of the lock add [mem], reg case.)
Related:
Interrupting instruction in the middle of execution only a few instructions like rep movsb and vpgatherdd are interruptible part way through, and they don't support lock. They also have a well-defined way to update architectural state to record their partial progress, not like a few ISAs where microarchitectural progress can get saved in hidden locations and resumed after an interrupt.
Interrupting an assembly instruction while it is operating quotes Intel's manuals about that guarantee
When an interrupt occurs, what happens to instructions in the pipeline?
You are mixing up high-level locks with the low-level CPU feature that happened to be named LOCK.
The high-level locks that lock-free algorithms try to avoid can guard arbitrary code fragments whose execution may take arbitrary time and thus, these locks will have to put threads into wait state until the lock is available which is a costly operation, e.g. implies maintaining a queue of waiting threads.
This is an entirely different thing than the CPU LOCK prefix feature which guards a single instruction only and thus might hold other threads for the duration of that single instruction only. Since this is implemented by the CPU itself, it doesn’t require additional software efforts.
Therefore the challenge of developing lock-free algorithms is not the removal of synchronization entirely, it boils down to reduce the critical section of the code to a single atomic operation which will be provided by the CPU itself.
The LOCK prefix is to lock the memory access for the current command, so that other commands that are in the CPU pipeline can not access the memory at the same time. Using the LOCK prefix, the execution of the command won't be interrupted by another command in the CPU pipeline due to memory access of other commands that are executed at the same time.
The INTEL manual says:
The LOCK prefix can be prepended only to the following in structions
and only to those forms of the instructions where the destination
operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and
XCHG. If the LOCK prefix is used with one of these instructions and
the source operand is a memory operand, an undefined opcode exception
(#UD) may be generated.

ARM atomics performance

I am running the same code on an Intel CPU and an ARM CPU (Mac/iOS, compiler: Clang). By profiling the application, I noticed, that on iOS/ARM the atomic operations are the top 3 items, while on Intel, they are not even in the top 10. Is that true, that on ARM atomic operations are that much slower? (relatively of course)
One point to note is that, thanks to implementation details, you're not necessarily seeing the whole story.
Under the load-linked/store-conditional paradigm of ARM, any atomic operation is at least 4 instructions - load-exclusive, <operation>1, store-exclusive, conditional branch to retry if necessary. Every other core is entirely oblivious to this and carries on doing its own thing.
On x86, however, where instructions can operate directly on memory, atomics are typically accomplished by sticking the LOCK prefix on a single instruction. This means 2 things: firstly, you can never be interrupted inside your atomic 'routine' since it's a single instruction. Secondly, no other core can access memory while the bus is locked, so it effectively pauses execution of everything until it completes2. Together, these mean that a sampling profiler will rarely, if ever, catch the atomic operation 'in progress' regardless of how long it actually takes.
[1] OK, so that makes an atomic swap only 3 instructions, but anything else has one or more instructions in the middle here.
[2] This is slightly less true of modern cores which will only lock their own cache, rather than everything, to avoid affecting other cores accessing unrelated areas, but the hardware cache-coherency will still prevent anyone else interfering.

How fast is an atomic/interlocked variable compared to a lock, with or without contention? [duplicate]

This question already has answers here:
Overhead of using locks instead of atomic intrinsics
(4 answers)
Closed 3 years ago.
And how much faster/slower it is as compared to an uncontested atomic variable (such as std::atomic<T> of C++) operation.
Also, how much slower are contested atomic variables relative to the uncontested lock?
The architecture I'm working on is x86-64.
I happen to have a lot of low-level speed tests lying around. However, what exactly speed means is very uncertain because it depends a lot on what exactly you are doing (even unrelated from the operation itself).
Here are some numbers from an AMD 64-Bit Phenom II X6 3.2Ghz. I've also run this on Intel chips and the times do vary a lot (again, depending on exactly what is being done).
A GCC __sync_fetch_and_add, which would be a fully-fenced atomic addition, has an average of 16ns, with a minimum time of 4ns. The minimum time is probably closer to the truth (though even there I have a bit of overhead).
An uncontested pthread mutex (through boost) is 14ns (which is also its minimum). Note this is also a bit too low, since the time will actually increase if something else had locked the mutex but it isn't uncontested now (since it will cause a cache sync).
A failed try_lock is 9ns.
I don't have a plain old atomic inc since on x86_64 this is just a normal exchange operation. Likely close to the minimum possible time, so 1-2ns.
Calling notify without a waiter on a condition variable is 25ns (if something is waiting about 304ns).
As all locks however cause certain CPU ordering guarantees, the amount of memory you have modified (whatever fits in the store buffer) will alter how long such operations take. And obviously if you ever have contention on a mutex that is your worst time. Any return to the linux kernel can be hundreds of nanoseconds even if no thread switch actually occurs. This is usually where atomic locks out-perform since they don't ever involve any kernel calls: your average case performance is also your worst case. Mutex unlocking also incurs an overhead if there are waiting threads, whereas an atomic would not.
NOTE: Doing such measurements is fraught with problems, so the results are always kind of questionable. My tests try to minimize variation by fixating CPU speed, setting cpu affinity for threads, running no other processes, and averaging over large result sets.
There’s a project on GitHub with the purpose of measuring this on different platforms. Unfortunately, after my master thesis I never really had the time to follow up on this but at least the rudimentary code is there.
It measures pthreads and OpenMP locks, compared to the __sync_fetch_and_add intrinsic.
From what I remember, we were expecting a pretty big difference between locks and atomic operations (~ an order of magnitude) but the real difference turned out to be very small.
However, measuring now on my system yields results which reflect my original guess, namely that (regardless of whether pthreads or OpenMP is used) atomic operations are about five times faster, and a single locked increment operation takes about 35ns (this includes acquiring the lock, performing the increment, and releasing the lock).
depends on the lock implementation, depends on the system too. Atomic variables can't be really be contested in the same way as a lock (not even if you are using acquire-release semantics), that is the whole point of atomicity, it locks the bus to propagate the store (depending on the memory barrier mode), but thats an implementation detail.
However, most user-mode locks are just wrapped atomic ops, see this article by Intel for some figures on high performance, scalable locks using atomic ops under x86 and x64 (compared against Windows' CriticalSection locks, unfortunately, no stats are to be found for the SWR locks, but one should always profile for ones own system/environment).