8/16-bit atomics on 32/64-bit processors - c++

In C++11 and C11 it is possible to use 8- and 16-bit atomics. Are there any pitfalls of using them on actual modern 32- and 64-bit CPUs? Are they lock-free? Are they slower than native-size atomics? I'm interested in both what standard says about it and how it's actually implemented on common architectures.

There are no common pitfalls or any reason to expect any.
The standard say nothing about it, but basically nothing about performance guarantees in general. But in practice, if atomic<int> is lock-free, it's almost certain that atomic<int16_t> and atomic<int8_t> are also lock-free. I'd be surprised if there are any mainstream implementations where that's not true.
x86 hardware supports them directly, at the same speed as other operand-sizes. e.g. mov load/store, and for atomic RMWs, lock xadd byte [rdi], al exists in byte operand-size as well as word/dword/qword. Same for all other atomic RMW instructions, including xchg and cmpxchg.
Other ISAs may have minor slowdowns for narrow stores (and maybe also loads), like a cycle of extra latency for a pure-load or pure-store. This is pretty much negligible compared to inter-core latency, and pretty minor even when a cache line is already hot. See Are there any modern CPUs where a cached byte store is actually slower than a word store? (it's not unique to atomic operations.)
Most non-x86 ISAs also have byte and 16-bit versions of the same instructions they provide for atomic RMWs, like ARM ldrexb / strexb.
Of course for an atomic RMW, it's also safe to do an RMW of the containing word, and that can be done "naturally" with minimal extra work for a fetch_or or other bitwise boolean, or a CAS. But I think most widely used ISAs have direct support for byte and 16-bit operations, so don't need that trick.

Related

Are memory orderings: consume, acq_rel and seq_cst ever needed on Intel x86?

C++11 specifies six memory orderings:
typedef enum memory_order {
memory_order_relaxed,
memory_order_consume,
memory_order_acquire,
memory_order_release,
memory_order_acq_rel,
memory_order_seq_cst
} memory_order;
https://en.cppreference.com/w/cpp/atomic/memory_order
where the default is seq_cst.
Performance gains can be found by relaxing the memory ordering of operations. However, this depends on what protections the architecture provides. For example, Intel x86 is a strong memory model and guarantees that various loads/store combinations will not be re-ordered.
As such relaxed, acquire and release seem to be the only orderings required when seeking additional performance on x86.
Is this correct? If not, is there ever a need to use consume, acq_rel and seq_cst on x86?
If you care about portable performance, you should ideally write your C++ source with the minimum necessary ordering for each operation. The only thing that really costs "extra" on x86 is mo_seq_cst for a pure store, so make a point of avoiding that even for x86.
(relaxed ops can also allow more compile-time optimization of the surrounding non-atomic operations, e.g. CSE and dead store elimination, because relaxed ops avoid a compiler barrier. If you don't need any order wrt. surrounding code, tell the compiler that fact so it can optimize.)
Keep in mind that you can't fully test weaker orders if you only have x86 hardware, especially atomic RMWs with only acquire or release, so in practice it's safer to leave your RMWs as seq_cst if you're doing anything that's already complicated and hard to reason about correctness.
x86 asm naturally has acquire loads, release stores, and seq_cst RMW operations. Compile-time reordering is possible with weaker orders in the source, but after the compiler makes its choices, those are "nailed down" into x86 asm. (And stronger store orders require an mfence after mov, or using xchg. seq_cst loads don't actually have any extra cost, but it's more accurate to describe them as acquire because earlier stores can reorder past them, and all being acquire means they can't reorder with each other.)
There are very few use-cases where seq_cst is required (draining the store buffer before later loads can happen). Almost always a weaker order like acquire or release would also be safe.
There are artificial cases like https://preshing.com/20120515/memory-reordering-caught-in-the-act/, but even implementing locking generally only requires acquire and release ordering. (Of course taking a lock does require an atomic RMW, so on x86 that might as well be seq_cst.) One practical use-case I came up with was to have multiple threads set bits in an array. Avoid atomic RMWs and detect when one thread stepped on another by re-checking values that were recently stored. You have to wait until your stores are globally visible before you can safely reload them to check.
As such relaxed, acquire and release seem to be the only orderings required on x86.
From one POV, in C++ source you don't require any ordering weaker than seq_cst (except for performance); that's why it's the default for all std::atomic functions. Remember you're writing C++, not x86 asm.
Or if you mean to describe the full range of what x86 asm can do, then it's acq for loads, rel for pure stores, and seq_cst for atomic RMWs. (The lock prefix is a full barrier; fetch_add(1, relaxed) compiles to the same asm as seq_cst). x86 asm can't do a relaxed load or store1.
The only benefit to using relaxed in C++ (when compiling for x86) is to allow more optimization of surrounding non-atomic operations by reordering at compile time, e.g. to allow optimizations like store coalescing and dead-store elimination. Always remember that you're not writing x86 asm; the C++ memory model applies for compile-time ordering / optimization decisions.
acq_rel and seq_cst are nearly identical for atomic RMW operations in ISO C++,
I think no difference when compiling for ISAs like x86 and ARMv8 that are multi-copy-atomic. (No IRIW reordering like e.g. POWER can do by store-forwarding between SMT threads before a store commits to L1d). How do memory_order_seq_cst and memory_order_acq_rel differ?
For barriers, atomic_thread_fence(mo_acq_rel) compiles to zero instructions on x86, while fence(seq_cst) compiles to mfence or a faster equivalent (e.g. a dummy locked instruction on some stack memory). When is a memory_order_seq_cst fence useful?
You could say acq_rel and consume are truly useless if you're only compiling for x86. consume was intended to expose the dependency ordering that most weakly-ordered ISAs do (notably not DEC Alpha). But unfortunately it was designed in a way that compilers couldn't implement safely so they currently just give up and promote it to acquire, which costs a barrier on some weakly-ordered ISAs. But on x86, acquire is "free" so it's fine.
If you actually do need efficient consume, e.g. for RCU, your only real option is to use relaxed and don't give the compiler enough information to optimize away the data dependency from the asm it makes. C++11: the difference between memory_order_relaxed and memory_order_consume.
Footnote 1: I'm not counting movnt as a relaxed atomic store because the usual C++ -> asm mapping for release operations uses just a mov store, not sfence, and thus would not order an NT store. i.e. std::atomic leaves it up to you to use _mm_sfence() if you'd been messing around with _mm_stream_ps() stores.
PS: this entire answer is assuming normal WB (write-back) cacheable memory regions. If you just use C++ normally under a mainstream OS, all your memory allocations will be WB, not weakly-ordered WC or strongly-ordered uncacheable UC or anything else. In fact even if you wanted a WC mapping of a page, most OSes don't have an API for that. And std::atomic release stores would be broken on WC memory, weakly-ordered like NT stores.

Genuinely test std::atomic is lock-free or not

Since std::atomic::is_lock_free() may not genuinely reflect the reality [ref], I'm considering writing a genuine runtime test instead. However, when I got down to it, I found that it's not a trivial task I thought it to be. I'm wondering whether there is some clever idea that could do it.
Other than performance, the standard doesn't guarantee any way you can tell; that's more or less the point.
If you are willing to introduce some platform-specific UB, you could do something like cast a atomic<int64_t> * to a volatile int64_t* and see if you observe "tearing" when another thread reads the object. (When to use volatile with multi threading? - normally never, but real hardware has coherent caches between cores that run threads so plain asm load/store are basically like relaxed-atomic.)
If this test succeeds (i.e. the plain C++ type was naturally atomic with just volatile), that tells you any sane compiler will make it lock-free very cheaply. But if it fails, it doesn't tell you very much. A lock-free atomic for that type may be only slightly more expensive than the plain version for loads/stores, or the compiler may not make it lock-free at all. e.g. on 32-bit x86 where lock-free int64_t is efficient with only small overhead (using SSE2 or x87), but volatile int64_t* will produce tearing using two separate 4-byte integer loads or stores the way most compilers compile it.
On any specific platform / target architecture, you can single-step your code in a debugger and see what asm instructions run. (Including stepping into libatomic function calls like __atomic_store_16). This is the only 100% reliable way. (Plus consulting ISA documentation to check atomicity guarantees for different instructions, e.g. whether ARM load/store pair is guaranteed, under what conditions.)
(Fun fact: gcc7 with statically linked libatomic may always use locking for 16-byte objects on x86-64, because it doesn't have an opportunity to do runtime CPU detection at dynamic link time and use lock cmpxchg16b on CPUs that support it, with the same mechanism glibc uses to pick optimal memcpy / strchr implementations for the current system.)
You could portably look for a performance difference (e.g. scalability with multiple readers), but x86-64 lock cmpxchg16b doesn't scale1. Multiple readers contend with each other, unlike 8 byte and narrower atomic objects where pure asm loads are atomic and can be used. lock cmpxchg16b acquires exclusive access to a cache line before executing; abusing the side-effect of atomically loading the old value on failure to implement .load() is much worse than an 8-byte atomic load which compiles to just a regular load instruction.
That's part of the reason that gcc7 decided to stop returning true for is_lock_free() on 16-byte objects, as described in the GCC mailing list message about the change you're asking about.
Also note that clang on 32-bit x86 uses lock cmpxchg8b to implement std::atomic<int64_t>, just like for 16-byte objects in 64-bit mode. So you would see a lack of parallel read scaling with it, too. (https://bugs.llvm.org/show_bug.cgi?id=33109)
std::atomic<> implementations that use locking usually still don't make the object larger by including a lock byte or word in each object. It would change the ABI, but lock-free vs. locking is already an ABI difference. The standard allows this, I think, but weird hardware might need extra bytes in the object even when lock-free. Anyway sizeof(atomic<T>) == sizeof(T) doesn't tell you anything either way. If it's larger it's most likely that your implementation added a mutex, but you can't be sure without checking the asm. (If the size wasn't a power of 2, it could have widened it for alignment.)
(In C11, there's much less scope for including a lock in the object: it has to work even with minimal initialization (e.g. statically to 0), and no destructor. Compilers / ABIs generally want their C stdatomic.h atomics to be compatible with their C++ std::atomic atomics.)
The normal mechanism is to use the address of the atomic object as a key for a global hash table of locks. Two objects aliasing / colliding and sharing the same lock is extra contention, but not a correctness problem. These locks are only taken/released from library functions, not while holding other such locks, so it can't create a deadlock.
You could detect this by using shared memory between two different processes (so each process would have its own hash table of locks).
Is C++11 atomic<T> usable with mmap?
check that std::atomic<T> is the same size as T (so the lock isn't in the object itself).
Map a shared memory segment from two separate processes that don't otherwise share any of their address space. It doesn't matter if you map it to a different base address in each process.
Store patterns like all-ones and all-zeros from one process while reading from the other (and look for tearing). Same as what I suggested with volatile above.
Also test atomic increment: have each thread do 1G increments and check that the result is 2G every time. Even if pure load and pure store are naturally atomic (the tearing test), read-modify-write operations like fetch_add / operator++ need special support: Can num++ be atomic for 'int num'?
From the C++11 standard, the intent is that this should still be atomic for lock-free objects. It might also work for non-lock-free objects (if they embed the lock in the object), which is why you have to rule that out by checking sizeof().
To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state.
If you see tearing between two processes, the object wasn't lock-free (at least not the way C++11 intended, and not the way you'd expect on normal shared-memory CPUs.)
I'm not sure why address-free matters if the processes don't have to share any address-space other than 1 page containing the atomic object2. (Of course, C++11 doesn't require that the implementation uses pages at all. Or maybe an implementation could put the hash table of locks at the top or bottom of each page? In which case using a hash function that depended on address bits above the page offset would be totally silly.)
Anyway, this depends on a lot of assumptions about how computers work that are true on all normal CPUs, but which C++ doesn't make. If the implementation you care about is on a mainstream CPU like x86 or ARM under a normal OS, then this testing method should be fairly accurate and might be an alternative to just reading the asm. It's not something that's very practical to do automatically at compile time, but it would be possible to automate a test like this and put it into a build script, unlike reading the asm.
Footnote 1: 16-byte atomics on x86
(Update: Intel recently documented that the AVX feature bit implies 16-byte atomicity for aligned loads/stores, such as with movaps. At least on Intel CPUs specifically; AMD CPUs with AVX in practice seem to be like that too, but AMD hasn't yet documented it officially. The rest of this answer was written before that, but GCC's libatomic does use vmovdqa [mem], xmm / mfence for atomic 16-byte stores on CPUs where that's guaranteed atomic.)
No x86 hardware documents support for 16-byte atomic load/store with SSE instructions. In practice many modern CPUs do have atomic movaps load/store, but there are no guarantees of this in Intel/AMD manuals the way there are for 8-byte x87/MMX/SSE loads/stores on Pentium and later. And no way to detect which CPUs do/don't have atomic 128-bit ops (other than lock cmpxchg16b), so compiler writers can't safely use them.
See SSE instructions: which CPUs can do atomic 16B memory operations? for a nasty corner case: testing on K10 shows that aligned xmm load/store shows no tearing between threads on the same socket, but threads on different sockets experience rare tearing because HyperTransport apparently only gives the minimum x86 atomicity guarantee of 8 byte objects. (IDK if lock cmpxchg16b is more expensive on a system like that.)
Without published guarantees from vendors, we can never be sure about weird microarchitectural corner cases, either. Lack of tearing in a simple test with one thread writing patterns and the other reading is pretty good evidence, but it's always possible that something could be different in some special case the CPU designers decided to handle a different way than normal.
A pointer + counter struct where read-only access only needs the pointer can be cheap, but current compilers need union hacks to get them to do an 8-byte atomic load of just the first half of the object. How can I implement ABA counter with c++11 CAS?. For an ABA counter, you'd normally update it with a CAS anyway, so lack of a 16-byte atomic pure store is not a problem.
An ILP32 ABI (32-bit pointers) in 64-bit mode (like Linux's x32 ABI, or AArch64's ILP32 ABI) means pointer+integer can fit in only 8 bytes, but integer registers are still 8 bytes wide. This makes it much more efficient to use a pointer+counter atomic object than in full 64-bit mode where a pointer is 8 bytes.
Footnote 2: address-free
I think the term "address-free" is a separate claim from not depending on any per-process state. As I understand it, it means that correctness doesn't depend on both threads using the same address for the same memory location. But if correctness also depends on them sharing the same global hash table (IDK why storing the address of an object in the object itself would ever help), that would only matter if it was possible to have multiple addresses for the same object within the same process. That is possible on something like x86's real-mode segmentation model, where a 20-bit linear address space is addressed with 32-bit segment:offset. (Actual C implementations for 16-bit x86 exposed segmentation to the programmer; hiding it behind C's rules would be possible but not high performance.)
It's also possible with virtual memory: two mappings of the same physical page to different virtual addresses within the same process is possible but weird. That might or might not use the same lock, depending on whether the hash function uses any address bits above the page offset.
(The low bits of an address, that represent the offset within a page, are the same for every mapping. i.e. virtual to physical translation for those bits is a no-op, which is why VIPT caches are usually designed to take advantage of that to get speed without aliasing.)
So a non-lock-free object might be address-free within a single process, even if it uses a separate global hash table instead of adding a mutex to the atomic object. But this would be a very unusual situation; it's extremely rare to use virtual memory tricks to create two addresses for the same variable within the same process that shares all of its address-space between threads. Much more common would be atomic objects in shared memory between processes. (I may be misunderstanding the meaning of "address-free"; possibly it means "address-space free", i.e. lack of dependency on other addresses being shared.)
I think you are really just trying to detect this special case specific to gcc where is_lock_free reports false, but the underlying implementation (hidden behind a libatomic function call) is still using cmpxchg16b. You want to know about this, since you consider such an implementation genuinely lock free.
In that case, as a practical matter, I would just write your detection function to hardcode the gcc version range you know operates in this manner. Currently, all versions after the one in which the change to stop inlining cmpxchg16b apparently still use a lock-free implementation under the covers, so a check today would be "open ended" (i.e., all versions after X). Prior to this point is_lock_free returns true (which you consider correct). After some hypothetical future change to gcc which makes the library call use locks, the is_lock_free() == false answer will become genuinely true, and you'll close your check by recording the version in which it occurred.
So something like this should be a good start:
template <typename T>
bool is_genuinely_lock_free(std::atomic<T>& t) {
#if __GNUC__ >= LF16_MAJOR_FIRST && __GNUC_MINOR__ >= LF16_MINOR_FIRST && \
__GNUC__ <= LF16_MAJOR_LAST && __GNUC_MINOR__ <= LF16_MINOR_LAST
return sizeof(T) == 16 || t.is_lock_free();
#else
return t.is_lock_free();
#endif
}
Here the LF16 macros define the version range where gcc returns the "wrong" answer for is_lock_free for 16-byte objects. Note that since second half of this change (to make __atomic_load_16 and friends use locks) you'll only need the first half of the check today. You need to determine the exact version when is_lock_free() started returning false for 16-byte objects: the links Peter provides discussing this issue are a good start, and you can do some checking in godbolt - although the latter doesn't provide everything you need since it doesn't decompile library functions like __atomic_load16: you may need to dig into the libatomic source for that. It's also possible that the macro check should be tied to the libstdc++ or libatomic version instead of the compiler version (although AFAIK in typical installs the versions of all of those are bound together). You'll probably want to add a few more checks to the #if to limit it to 64-bit x86 platforms as well.
I think this approach is valid since the concept of genuinely lock-free isn't really well-defined: you have decided in this case you want to consider the cmpxchg16b implementation in gcc lock-free, but if other grey areas occur in other future implementations you'll want to make another judgment call about whether you consider it lock-free. So the hardcoding approach seems approximately as robust for the non-gcc cases as some type of detection since in either case unknown future implementations may trigger the wrong answer. For the gcc case it seems more robust and definitely more simple.
The basis for this idea is that getting the answer wrong is not going to be a world-destroying functional problem, but rather a performance issue: I'm guessing you are trying to do this detection to select between alternate implementations one of which is faster on a "genuinely" lock-free system, and other being more suitable when std::atomic is lock-based.
If your requirements are stronger, and you really want to be more robust, why not combine approaches: use this simple version detection approach and combine it with a runtime/compile-time detection approach which examines tearing behavior or decompilation as suggested in Peter's answer. If both approaches agree, use it as your answer; if they disagree, however, surface the error and do further investigation. This will also help you catch the point, if ever, at which gcc changes the implementation to make 16-byte objects lock-full.

Why does a std::atomic store with sequential consistency use XCHG?

Why is std::atomic's store:
std::atomic<int> my_atomic;
my_atomic.store(1, std::memory_order_seq_cst);
doing an xchg when a store with sequential consistency is requested?
Shouldn't, technically, a normal store with a read/write memory barrier be enough? Equivalent to:
_ReadWriteBarrier(); // Or `asm volatile("" ::: "memory");` for gcc/clang
my_atomic.store(1, std::memory_order_acquire);
I'm explicitly talking about x86 & x86_64. Where a store has an implicit acquire fence.
mov-store + mfence and xchg are both valid ways to implement a sequential-consistency store on x86. The implicit lock prefix on an xchg with memory makes it a full memory barrier, like all atomic RMW operations on x86.
(x86's memory-ordering rules essentially make that full-barrier effect the only option for any atomic RMW: it's both a load and a store at the same time, stuck together in the global order. Atomicity requires that the load and store aren't separated by just queuing the store into the store buffer so it has to be drained, and load-load ordering of the load side requires that it not reorder.)
Plain mov is not sufficient; it only has release semantics, not sequential-release. (Unlike AArch64's stlr instruction, which does do a sequential-release store that can't reorder with later ldar sequential-acquire loads. This choice is obviously motivated by C++11 having seq_cst as the default memory ordering. But AArch64's normal store is much weaker; relaxed not release.)
See Jeff Preshing's article on acquire / release semantics, and note that regular release stores (like mov or any non-locked x86 memory-destination instruction other than xchg) allows reordering with later operations, including acquire loads (like mov or any x86 memory-source operand). e.g. If the release-store is releasing a lock, it's ok for later stuff to appear to happen inside the critical section.
There are performance differences between mfence and xchg on different CPUs, and maybe in the hot vs. cold cache and contended vs. uncontended cases. And/or for throughput of many operations back to back in the same thread vs. for one on its own, and for allowing surrounding code to overlap execution with the atomic operation.
See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for actual benchmarks of mfence vs. lock addl $0, -8(%rsp) vs. (%rsp) as a full barrier (when you don't already have a store to do).
On Intel Skylake hardware, mfence blocks out-of-order execution of independent ALU instructions, but xchg doesn't. (See my test asm + results in the bottom of this SO answer). Intel's manuals don't require it to be that strong; only lfence is documented to do that. But as an implementation detail, it's very expensive for out-of-order execution of surrounding code on Skylake.
I haven't tested other CPUs, and this may be a result of a microcode fix for erratum SKL079, SKL079 MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum basically proves that SKL used to be able to execute instructions after MFENCE. I wouldn't be surprised if they fixed it by making MFENCE stronger in microcode, kind of a blunt instrument approach that significantly increases the impact on surrounding code.
I've only tested the single-threaded case where the cache line is hot in L1d cache. (Not when it's cold in memory, or when it's in Modified state on another core.) xchg has to load the previous value, creating a "false" dependency on the old value that was in memory. But mfence forces the CPU to wait until previous stores commit to L1d, which also requires the cache line to arrive (and be in M state). So they're probably about equal in that respect, but Intel's mfence forces everything to wait, not just loads.
AMD's optimization manual recommends xchg for atomic seq-cst stores. I thought Intel recommended mov + mfence, which older gcc uses, but Intel's compiler also uses xchg here.
When I tested, I got better throughput on Skylake for xchg than for mov+mfence in a single-threaded loop on the same location repeatedly. See Agner Fog's microarch guide and instruction tables for some details, but he doesn't spend much time on locked operations.
See gcc/clang/ICC/MSVC output on the Godbolt compiler explorer for a C++11 seq-cst my_atomic = 4; gcc uses mov + mfence when SSE2 is available. (use -m32 -mno-sse2 to get gcc to use xchg too). The other 3 compilers all prefer xchg with default tuning, or for znver1 (Ryzen) or skylake.
The Linux kernel uses xchg for __smp_store_mb().
Update: recent GCC (like GCC10) changed to using xchg for seq-cst stores like other compilers do, even when SSE2 for mfence is available.
Another interesting question is how to compile atomic_thread_fence(mo_seq_cst);. The obvious option is mfence, but lock or dword [rsp], 0 is another valid option (and used by gcc -m32 when MFENCE isn't available). The bottom of the stack is usually already hot in cache in M state. The downside is introducing latency if a local was stored there. (If it's just a return address, return-address prediction is usually very good so delaying ret's ability to read it is not much of a problem.) So lock or dword [rsp-4], 0 could be worth considering in some cases. (gcc did consider it, but reverted it because it makes valgrind unhappy. This was before it was known that it might be better than mfence even when mfence was available.)
All compilers currently use mfence for a stand-alone barrier when it's available. Those are rare in C++11 code, but more research is needed on what's actually most efficient for real multi-threaded code that has real work going on inside the threads that are communicating locklessly.
But multiple source recommend using lock add to the stack as a barrier instead of mfence, so the Linux kernel recently switched to using it for the smp_mb() implementation on x86, even when SSE2 is available.
See https://groups.google.com/d/msg/fa.linux.kernel/hNOoIZc6I9E/pVO3hB5ABAAJ for some discussion, including a mention of some errata for HSW/BDW about movntdqa loads from WC memory passing earlier locked instructions. (Opposite of Skylake, where it was mfence instead of locked instructions that were a problem. But unlike SKL, there's no fix in microcode. This may be why Linux still uses mfence for its mb() for drivers, in case anything ever uses NT loads to copy back from video RAM or something but can't let the reads happen until after an earlier store is visible.)
In Linux 4.14, smp_mb() uses mb(). That uses mfence is used if available, otherwise lock addl $0, 0(%esp).
__smp_store_mb (store + memory barrier) uses xchg (and that doesn't change in later kernels).
In Linux 4.15, smb_mb() uses lock; addl $0,-4(%esp) or %rsp, instead of using mb(). (The kernel doesn't use a red-zone even in 64-bit, so the -4 may help avoid extra latency for local vars).
mb() is used by drivers to order access to MMIO regions, but smp_mb() turns into a no-op when compiled for a uniprocessor system. Changing mb() is riskier because it's harder to test (affects drivers), and CPUs have errata related to lock vs. mfence. But anyway, mb() uses mfence if available, else lock addl $0, -4(%esp). The only change is the -4.
In Linux 4.16, no change except removing the #if defined(CONFIG_X86_PPRO_FENCE) which defined stuff for a more weakly-ordered memory model than the x86-TSO model that modern hardware implements.
x86 & x86_64. Where a store has an implicit acquire fence
You mean release, I hope. my_atomic.store(1, std::memory_order_acquire); won't compile, because write-only atomic operations can't be acquire operations. See also Jeff Preshing's article on acquire/release semantics.
Or asm volatile("" ::: "memory");
No, that's a compiler barrier only; it prevents all compile-time reordering across it, but doesn't prevent runtime StoreLoad reordering, i.e. the store being buffered until later, and not appearing in the global order until after a later load. (StoreLoad is the only kind of runtime reordering x86 allows.)
Anyway, another way to express what you want here is:
my_atomic.store(1, std::memory_order_release); // mov
// with no operations in between, there's nothing for the release-store to be delayed past
std::atomic_thread_fence(std::memory_order_seq_cst); // mfence
Using a release fence would not be strong enough (it and the release-store could both be delayed past a later load, which is the same thing as saying that release fences don't keep later loads from happening early). A release-acquire fence would do the trick, though, keeping later loads from happening early and not itself being able to reorder with the release store.
Related: Jeff Preshing's article on fences being different from release operations.
But note that seq-cst is special according to C++11 rules: only seq-cst operations are guaranteed to have a single global / total order which all threads agree on seeing. So emulating them with weaker order + fences might not be exactly equivalent in general on the C++ abstract machine, even if it is on x86. (On x86, all store have a single total order which all cores agree on. See also Globally Invisible load instructions: Loads can take their data from the store buffer, so we can't really say that there's a total order for loads + stores.)

Why is integer assignment on a naturally aligned variable atomic on x86?

I've been reading this article about atomic operations, and it mentions 32-bit integer assignment being atomic on x86, as long as the variable is naturally aligned.
Why does natural alignment assure atomicity?
"Natural" alignment means aligned to its own type width. Thus, the load/store will never be split across any kind of boundary wider than itself (e.g. page, cache-line, or an even narrower chunk size used for data transfers between different caches).
CPUs often do things like cache-access, or cache-line transfers between cores, in power-of-2 sized chunks, so alignment boundaries smaller than a cache line do matter. (See #BeeOnRope's comments below). See also Atomicity on x86 for more details on how CPUs implement atomic loads or stores internally, and Can num++ be atomic for 'int num'? for more about how atomic RMW operations like atomic<int>::fetch_add() / lock xadd are implemented internally.
First, this assumes that the int is updated with a single store instruction, rather than writing different bytes separately. This is part of what std::atomic guarantees, but that plain C or C++ doesn't. It will normally be the case, though. The x86-64 System V ABI doesn't forbid compilers from making accesses to int variables non-atomic, even though it does require int to be 4B with a default alignment of 4B. For example, x = a<<16 | b could compile to two separate 16-bit stores if the compiler wanted.
Data races are Undefined Behaviour in both C and C++, so compilers can and do assume that memory is not asynchronously modified. For code that is guaranteed not to break, use C11 stdatomic or C++11 std::atomic. Otherwise the compiler will just keep a value in a register instead of reloading every time your read it, like volatile but with actual guarantees and official support from the language standard.
Before C++11, atomic ops were usually done with volatile or other things, and a healthy dose of "works on compilers we care about", so C++11 was a huge step forward. Now you no longer have to care about what a compiler does for plain int; just use atomic<int>. If you find old guides talking about atomicity of int, they probably predate C++11. When to use volatile with multi threading? explains why that works in practice, and that atomic<T> with memory_order_relaxed is the modern way to get the same functionality.
std::atomic<int> shared; // shared variable (compiler ensures alignment)
int x; // local variable (compiler can keep it in a register)
x = shared.load(std::memory_order_relaxed);
shared.store(x, std::memory_order_relaxed);
// shared = x; // don't do that unless you actually need seq_cst, because MFENCE or XCHG is much slower than a simple store
Side-note: for atomic<T> larger than the CPU can do atomically (so .is_lock_free() is false), see Where is the lock for a std::atomic?. int and int64_t / uint64_t are lock-free on all the major x86 compilers, though.
Thus, we just need to talk about the behaviour of an instruction like mov [shared], eax.
TL;DR: The x86 ISA guarantees that naturally-aligned stores and loads are atomic, up to 64bits wide. So compilers can use ordinary stores/loads as long as they ensure that std::atomic<T> has natural alignment.
(But note that i386 gcc -m32 fails to do that for C11 _Atomic 64-bit types inside structs, only aligning them to 4B, so atomic_llong can be non-atomic in some cases. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). g++ -m32 with std::atomic is fine, at least in g++5 because https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 was fixed in 2015 by a change to the <atomic> header. That didn't change the C11 behaviour, though.)
IIRC, there were SMP 386 systems, but the current memory semantics weren't established until 486. This is why the manual says "486 and newer".
From the "Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 3", with my notes in italics. (see also the x86 tag wiki for links: current versions of all volumes, or direct link to page 256 of the vol3 pdf from Dec 2015)
In x86 terminology, a "word" is two 8-bit bytes. 32 bits are a double-word, or DWORD.
###Section 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory
operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary (This is another way of saying "natural alignment")
That last point that I bolded is the answer to your question: This behaviour is part of what's required for a processor to be an x86 CPU (i.e. an implementation of the ISA).
The rest of the section provides further guarantees for newer Intel CPUs: Pentium widens this guarantee to 64 bits.
The
Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:
Reading or writing a quadword aligned on a 64-bit boundary
(e.g. x87 load/store of a double, or cmpxchg8b (which was new in Pentium P5))
16-bit accesses to uncached memory locations that fit within a 32-bit data bus.
The section goes on to point out that accesses split across cache lines (and page boundaries) are not guaranteed to be atomic, and:
"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using
multiple memory accesses."
AMD's manual agrees with Intel's about aligned 64-bit and narrower loads/stores being atomic
So integer, x87, and MMX/SSE loads/stores up to 64b, even in 32-bit or 16-bit mode (e.g. movq, movsd, movhps, pinsrq, extractps, etc.) are atomic if the data is aligned. gcc -m32 uses movq xmm, [mem] to implement atomic 64-bit loads for things like std::atomic<int64_t>. Clang4.0 -m32 unfortunately uses lock cmpxchg8b bug 33109.
On some CPUs with 128b or 256b internal data paths (between execution units and L1, and between different caches), 128b and even 256b vector loads/stores are atomic, but this is not guaranteed by any standard or easily queryable at run-time, unfortunately for compilers implementing std::atomic<__int128> or 16B structs.
(Update: x86 vendors have decided that the AVX feature bit also indicates atomic 128-bit aligned loads/stores. Before that we only had https://rigtorp.se/isatomic/ experimental testing to verify it.)
If you want atomic 128b across all x86 systems, you must use lock cmpxchg16b (available only in 64bit mode). (And it wasn't available in the first-gen x86-64 CPUs. You need to use -mcx16 with GCC/Clang for them to emit it.)
Even CPUs that internally do atomic 128b loads/stores can exhibit non-atomic behaviour in multi-socket systems with a coherency protocol that operates in smaller chunks: e.g. AMD Opteron 2435 (K10) with threads running on separate sockets, connected with HyperTransport.
Intel's and AMD's manuals diverge for unaligned access to cacheable memory. The common subset for all x86 CPUs is the AMD rule. Cacheable means write-back or write-through memory regions, not uncacheable or write-combining, as set with PAT or MTRR regions. They don't mean that the cache-line has to already be hot in L1 cache.
Intel P6 and later guarantee atomicity for cacheable loads/stores up to 64 bits as long as they're within a single cache-line (64B, or 32B on very old CPUs like Pentium III).
AMD guarantees atomicity for cacheable loads/stores that fit within a single 8B-aligned chunk. That makes sense, because we know from the 16B-store test on multi-socket Opteron that HyperTransport only transfers in 8B chunks, and doesn't lock while transferring to prevent tearing. (See above). I guess lock cmpxchg16b must be handled specially.
Possibly related: AMD uses MOESI to share dirty cache-lines directly between caches in different cores, so one core can be reading from its valid copy of a cache line while updates to it are coming in from another cache.
Intel uses MESIF, which requires dirty data to propagate out to the large shared inclusive L3 cache which acts as a backstop for coherency traffic. L3 is tag-inclusive of per-core L2/L1 caches, even for lines that have to be in the Invalid state in L3 because of being M or E in a per-core L1 cache. The data path between L3 and per-core caches is only 32B wide in Haswell/Skylake, so it must buffer or something to avoid a write to L3 from one core happening between reads of two halves of a cache line, which could cause tearing at the 32B boundary.
The relevant sections of the manuals:
The P6 family processors (and newer Intel processors
since) guarantee that the following additional memory operation will
always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.
AMD64 Manual 7.3.2 Access Atomicity
Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor
model, as are misaligned loads or stores of less than a quadword that
are contained entirely within a naturally-aligned quadword
Notice that AMD guarantees atomicity for any load smaller than a qword, but Intel only for power-of-2 sizes. 32-bit protected mode and 64-bit long mode can load a 48 bit m16:32 as a memory operand into cs:eip with far-call or far-jmp. (And far-call pushes stuff on the stack.) IDK if this counts as a single 48-bit access or separate 16 and 32-bit.
There have been attempts to formalize the x86 memory model, the latest one being the x86-TSO (extended version) paper from 2009 (link from the memory-ordering section of the x86 tag wiki). It's not usefully skimmable since they define some symbols to express things in their own notation, and I haven't tried to really read it. IDK if it describes the atomicity rules, or if it's only concerned with memory ordering.
Atomic Read-Modify-Write
I mentioned cmpxchg8b, but I was only talking about the load and the store each separately being atomic (i.e. no "tearing" where one half of the load is from one store, the other half of the load is from a different store).
To prevent the contents of that memory location from being modified between the load and the store, you need lock cmpxchg8b, just like you need lock inc [mem] for the entire read-modify-write to be atomic. Also note that even if cmpxchg8b without lock does a single atomic load (and optionally a store), it's not safe in general to use it as a 64b load with expected=desired. If the value in memory happens to match your expected, you'll get a non-atomic read-modify-write of that location.
The lock prefix makes even unaligned accesses that cross cache-line or page boundaries atomic, but you can't use it with mov to make an unaligned store or load atomic. It's only usable with memory-destination read-modify-write instructions like add [mem], eax.
(lock is implicit in xchg reg, [mem], so don't use xchg with mem to save code-size or instruction count unless performance is irrelevant. Only use it when you want the memory barrier and/or the atomic exchange, or when code-size is the only thing that matters, e.g. in a boot sector.)
See also: Can num++ be atomic for 'int num'?
Why lock mov [mem], reg doesn't exist for atomic unaligned stores
From the instruction reference manual (Intel x86 manual vol2), cmpxchg:
This instruction can be used with a LOCK prefix to allow the
instruction to be executed atomically. To simplify the interface to
the processor’s bus, the destination operand receives a write cycle
without regard to the result of the comparison. The destination
operand is written back if the comparison fails; otherwise, the source
operand is written into the destination. (The processor never produces
a locked read without also producing a locked write.)
This design decision reduced chipset complexity before the memory controller was built into the CPU. It may still do so for locked instructions on MMIO regions that hit the PCI-express bus rather than DRAM. It would just be confusing for a lock mov reg, [MMIO_PORT] to produce a write as well as a read to the memory-mapped I/O register.
The other explanation is that it's not very hard to make sure your data has natural alignment, and lock store would perform horribly compared to just making sure your data is aligned. It would be silly to spend transistors on something that would be so slow it wouldn't be worth using. If you really need it (and don't mind reading the memory too), you could use xchg [mem], reg (XCHG has an implicit LOCK prefix), which is even slower than a hypothetical lock mov.
Using a lock prefix is also a full memory barrier, so it imposes a performance overhead beyond just the atomic RMW. i.e. x86 can't do relaxed atomic RMW (without flushing the store buffer). Other ISAs can, so using .fetch_add(1, memory_order_relaxed) can be faster on non-x86.
Fun fact: Before mfence existed, a common idiom was lock add dword [esp], 0, which is a no-op other than clobbering flags and doing a locked operation. [esp] is almost always hot in L1 cache and won't cause contention with any other core. This idiom may still be more efficient than MFENCE as a stand-alone memory barrier, especially on AMD CPUs.
xchg [mem], reg is probably the most efficient way to implement a sequential-consistency store, vs. mov+mfence, on both Intel and AMD. mfence on Skylake at least blocks out-of-order execution of non-memory instructions, but xchg and other locked ops don't. Compilers other than gcc do use xchg for stores, even when they don't care about reading the old value.
Motivation for this design decision:
Without it, software would have to use 1-byte locks (or some kind of available atomic type) to guard accesses to 32bit integers, which is hugely inefficient compared to shared atomic read access for something like a global timestamp variable updated by a timer interrupt. It's probably basically free in silicon to guarantee for aligned accesses of bus-width or smaller.
For locking to be possible at all, some kind of atomic access is required. (Actually, I guess the hardware could provide some kind of totally different hardware-assisted locking mechanism.) For a CPU that does 32bit transfers on its external data bus, it just makes sense to have that be the unit of atomicity.
Since you offered a bounty, I assume you were looking for a long answer that wandered into all interesting side topics. Let me know if there are things I didn't cover that you think would make this Q&A more valuable for future readers.
Since you linked one in the question, I highly recommend reading more of Jeff Preshing's blog posts. They're excellent, and helped me put together the pieces of what I knew into an understanding of memory ordering in C/C++ source vs. asm for different hardware architectures, and how / when to tell the compiler what you want if you aren't writing asm directly.
If a 32-bit or smaller object is naturally-aligned within a "normal" part of memory, it will be possible for any 80386 or compatible processor other than the
80386sx to read or write all 32 bits of the object in a single operation. While the ability of a platform to do something in a quick and useful fashion doesn't necessarily mean the platform won't sometimes do it in some other fashion for some reason, and while I believe it's possible on many if not all x86 processors to have regions of memory which can only be accessed 8 or 16 bits at a time, I don't think Intel has ever defined any conditions where requesting an aligned 32-bit access to a "normal" area of memory would cause the system to read or write part of the value without reading or writing the whole thing, and I don't think Intel has any intention of ever defining any such thing for "normal" areas of memory.
Naturally aligned means that the address of the type is a multiple of the size of the type.
For example, a byte can be at any address, a short (assuming 16 bits) must be on a multiple of 2, an int (assuming 32 bits) must be on a multiple of 4, and a long (assuming 64 bits) must be on a multiple of 8.
In the event that you access a piece of data that is not naturally aligned the CPU will either raise a fault or will read/write the memory, but not as an atomic operation. The action the CPU takes will depend on the architecture.
For example, image we've got the memory layout below:
01234567
...XXXX.
and
int *data = (int*)3;
When we try to read *data the bytes that make up the value are spread across 2 int size blocks, 1 byte is in block 0-3 and 3 bytes are in block 4-7. Now, just because the blocks are logically next to each other it doesn't mean they are physically. For example, block 0-3 could be at the end of a cpu cache line, whilst block 3-7 is sitting in a page file. When the cpu goes to access block 3-7 in order to get the 3 bytes it needs it may see that the block isn't in memory and signals that it needs the memory paged in. This will probably block the calling process whilst the OS pages the memory back in.
After the memory has been paged in, but before your process is woken back up another one may come along and write a Y to address 4. Then your process is rescheduled and the CPU completes the read, but now it has read XYXX, rather than the XXXX you expected.
If you were asking why it's designed so, I would say it's a good side product from the design of CPU architecture.
Back in the 486 time, there is no multi-core CPU or QPI link, so atomicity isn't really a strict requirement at that time (DMA may require it?).
On x86, the data width is 32bits (or 64 bits for x86_64), meaning the CPU can read and write up to data width in one shot. And the memory data bus is typically the same or wider than this number. Combined with the fact that reading/writing on aligned address is done in one shot, naturally there is nothing preventing the read/write to be un-atomic. You gain speed/atomic at the same time.
To answer your first question, a variable is naturally aligned if it exists at a memory address that is a multiple of its size.
If we consider only - as the article you linked does - assignment instructions, then alignment guarantees atomicity because MOV (the assignment instruction) is atomic by design on aligned data.
Other kinds of instructions, INC for example, need to be LOCKed (an x86 prefix which gives exclusive access to the shared memory to the current processor for the duration of the prefixed operation) even if the data are aligned because they actually execute via multiple steps (=instructions, namely load, inc, store).

Memory ordering restrictions on x86 architecture

In his great book 'C++ Concurrency in Action' Anthony Williams writes the following (page 309):
For example, on x86 and x86-64 architectures, atomic load operations are
always the same, whether tagged memory_order_relaxed or memory_order_seq_cst
(see section 5.3.3). This means that code written using relaxed memory ordering may
work on systems with an x86 architecture, where it would fail on a system with a finer-
grained set of memory-ordering instructions such as SPARC.
Do I get this right that on x86 architecture all atomic load operations are memory_order_seq_cst? In addition, on the cppreference std::memory_order site is mentioned that on x86 release-aquire ordering is automatic.
If this restriction is valid, do the orderings still apply to compiler optimizations?
Yes, ordering still applies to compiler optimizations.
Also, it is not entirely exact that on x86 "atomic load operations are always the same".
On x86, all loads done with mov have acquire semantics and all stores done with mov have release semantics. So acq_rel, acq and relaxed loads are simple movs, and similarly acq_rel, rel and relaxed stores (acq stores and rel loads are always equal to relaxed).
This however is not necessarily true for seq_cst: the architecture does not guarantee seq_cst semantics for mov. In fact, the x86 instruction set does not have any specific instruction for sequentially consistent loads and stores. Only atomic read-modify-write operations on x86 will have seq_cst semantics. Hence, you could get seq_cst semantics for loads by doing a fetch_and_add operation (lock xadd instruction) with an argument of 0, and seq_cst semantics for stores by doing a seq_cst exchange operation (xchg instruction) and discarding the previous value.
But you do not need to do both! As long as all seq_cst stores are done with xchg, seq_cst loads can be implemented simply with a mov. Dually, if all loads were done with lock xadd, seq_cst stores could be implemented simply with a mov.
xchg and lock xadd are much slower than mov. Because a program has (usually) more loads than stores, it is convenient to do seq_cst stores with xchg so that the (more frequent) seq_cst loads can simply use a mov. This implementation detail is codified in the x86 Application Binary Interface (ABI). On x86, a compliant compiler must compile seq_cst stores to xchg so that seq_cst loads (which may appear in another translation unit, compiled with a different compiler) can be done with the faster mov instruction.
Thus it is not true in general that seq_cst and acquire loads are done with the same instruction on x86. It is only true because the ABI specifies that seq_cst stores be compiled to an xchg.
The compiler must of course follow the rules of the language, whatever hardware it runs on.
What he says is that on an x86 you don't have relaxed ordering, so you get a stricter ordering even if you don't ask for it. That also means that such code tested on an x86 might not work properly on a system that does have relaxed ordering.
It is worth keeping in mind that although a load relaxed and seq_cst load may map to the same instruction on x86, they are not the same. A load relaxed can be freely reordered by the compiler across memory operations to different memory locations while a seq_cst load cannot be reordered across other memory operations.
The sentence from the book is written in a somewhat misleading way. The ordering obtained on an architecture depends on not just how you translate atomic loads, but how you translate atomic stores.
The usual way to implement seq_cst on x86 is to flush the store buffer at some point between any seq_cst store and a subsequent seq_cst load from the same thread. The usual way for the compiler to guarantee this is to flush after stores, since there are fewer stores than loads. In this translation, seq_cst loads don't need to flush.
If you program x86 with just plain loads and stores, loads are guaranteed to provide acquire semantics, not seq_cst.
As for compiler optimization, in C11/C++11, the compiler does optimizations depending on code movement based on the semantics of the particular atomics, before considering the underlying hardware. (The hardware might provide stronger ordering, but there's no reason for the compiler to restrict its optimizations because of this.)
Do I get this right that on x86 architecture all atomic load
operations are memory_order_seq_cst?
Only executions (of a program, of some inter thread visible operations in a program) can be sequential. A single operation is not in itself sequential.
Asking whether the implementation of a single isolated operation is sequential is a meaningless question.
The translation of all memory operations that need some guarantee must be done following a strategy that enables that guarantee. There can be different strategies that have different compiler complexity costs and runtime costs.
[Just that there are different strategies to implement virtual functions: the only one that is OK (that fits all our expectations of speed, predictability and simplicity) is the use of vtables, so all compilers use vtable, but a virtual function is not defined as going through the vtable.]
In practice, there are not widely different strategies used to implement memory_order_seq_cst operations on a given CPU (that I know of). The differences between compilers are small and do not impede binary compatibility. But there are potentially differences and advanced global optimization of multi-threaded programs might open new opportunities for more efficient code generation for atomic operations.
Depending on your compiler, a program that contains only relaxed loads and memory_order_seq_cst modifications of std::atomic<> objects may or may not have exhibit only sequential behaviors, even on a strongly ordered CPU.