Determine if double word CAS is supported - c++

With the C++11 <atomic> library, you can take advantage of double-word CAS via the normal atomic<T>::compare_and_exchange_X family of functions by simply using a 16-byte type for T. The compiler will fall back on spin locks, apparently, if the architecture doesn't actually support double-word CAS.
A common use for double-word CAS is the implementation of tagged pointers - pointers which carry an extra counter value in order to avoid the ABA problem that plagues many lockfree data-structures and algorithms.
However - a lot of professional implementations of lockfree algorithms (such as Boost.Lockfree) don't actually use double-word CAS - instead they rely on single-word CAS and opt to use non-portable pointer-stuffing shenanigans. Since x86_64 architectures only use the first 48 bits of a pointer, you can use the hi 16-bits to stuff in a counter.
Of course, the pointer-stuffing technique only really works for x86_64 as far as I know - other 64-bit architectures don't necessarily guarantee that the hi 16-bits will be available. Additionally, the 16-bit counter only gives you a max value of 65536, so theoretically, in some absurdly pathological case, the ABA problem could still happen if one thread is pre-empted and then another thread(s) somehow performs more than 65,536 operations before the original thread wakes up again (I know, I know, a very unlikely scenario - but still insanely MORE likely than if the counter variable was 64-bit...)
Because of the above considerations, it seems that IF double word CAS is supported on some architecture, tagged pointer implementations should prefer to use a 16-byte pointer struct containing the actual pointer and a counter variable, rather than 8-byte stuffed pointers.
But... is there any way to actually determine at compile time if the CMPXCHG16B is supported? I know most modern x86_64 platforms support it, but some older computers don't. Therefore, it seems the ideal thing to do would be to check if double word CAS is supported, and if so use 16-byte pointers. If not, fall back on 8-byte stuffed pointers.
So, is there any way to actually check (via some #ifdefing) if CMPXCHG16B is supported?

Related

Smaller type for atomic that implements a sychronization primitive, such as semaphore, mutex or barrier

Synchronization primitives can be implemented using std::atomic<T>, at least their userspace part. With C++20 std::atomic<T>::wait, they can be based on atomic only.
The question is whether it is worth using pointer-sized types or smaller types.
C++20 std::semaphore is an example where max value is passed as template parameter, so the choice can be made at compile time, not hard-coded.
My thoughts so far:
I think that on platform where futex uses specific variable size, should just use that size. On Windows with flexible WaitOnAdress there's a room to pick.
32-bit types on x64 lead to smaller encoding and the same efficiency, so probably are the way to go. Further reducing to 16 and 8 bit types is apparently not useful (16-bit types result in larger encoding than 32-bit types).
There's also a case of Windows on arm, for which I'm not sure.
For std::semaphore, the ability to pick integer type based on max is side effect of specializing binary semaphore, rather than an intention.

Why is std::atomic<T>::is_lock_free() not static as well as constexpr?

Can anyone tell me whether std::atomic<T>::is_lock_free() isn't static as well as constexpr? Having it non-static and / or as non-constexpr doesn't make sense for me.
Why wasn't it designed like C++17's is_always_lock_free in the first place?
As explained on cppreference:
All atomic types except for std::atomic_flag may be implemented using mutexes or other locking operations, rather than using the lock-free atomic CPU instructions. Atomic types are also allowed to be sometimes lock-free, e.g. if only aligned memory accesses are naturally atomic on a given architecture, misaligned objects of the same type have to use locks.
The C++ standard recommends (but does not require) that lock-free atomic operations are also address-free, that is, suitable for communication between processes using shared memory.
As mentioned by multiple others, std::is_always_lock_free might be what you are really looking for.
Edit: To clarify, C++ object types have an alignment value that restricts the addresses of their instances to only certain multiples of powers of two ([basic.align]). These alignment values are implementation-defined for fundamental types, and need not equal the size of the type. They can also be more strict than what the hardware could actually support.
For example, x86 (mostly) supports unaligned accesses. However, you will find most compilers having alignof(double) == sizeof(double) == 8 for x86, as unaligned accesses have a host of disadvantages (speed, caching, atomicity...). But e.g. #pragma pack(1) struct X { char a; double b; }; or alignas(1) double x; allows you to have "unaligned" doubles. So when cppreference talks about "aligned memory accesses", it presumably does so in terms of the natural alignment of the type for the hardware, not using a C++ type in a way that contradicts its alignment requirements (which would be UB).
Here is more information: What's the actual effect of successful unaligned accesses on x86?
Please also check out the insightful comments by #Peter Cordes below!
You may use std::is_always_lock_free
is_lock_free depends on the actual system and can't be determined at compile time.
Relevant explanation:
Atomic types are also allowed to be sometimes lock-free, e.g. if only
aligned memory accesses are naturally atomic on a given architecture,
misaligned objects of the same type have to use locks.
I've got installed Visual Studio 2019 on my Windows-PC and this devenv has also an ARMv8-compiler. ARMv8 allows unaligned accesses, but compare and swaps, locked adds etc. are mandated to be aligned. And also pure load / pure store using ldp or stp (load-pair or store-pair of 32-bit registers) are only guaranteed to be atomic when they're naturally aligned.
So I wrote a little program to check what is_lock_free() returns for an arbitrary atomic-pointer. So here's the code:
#include <atomic>
#include <cstddef>
using namespace std;
bool isLockFreeAtomic( atomic<uint64_t> *a64 )
{
return a64->is_lock_free();
}
And this is the disassembly of isLockFreeAtomic
|?isLockFreeAtomic##YA_NPAU?$atomic#_K#std###Z| PROC
movs r0,#1
bx lr
ENDP
This is just returns true, aka 1.
This implementation chooses to use alignof( atomic<int64_t> ) == 8 so every atomic<int64_t> is correctly aligned. This avoids the need for runtime alignment checks on every load and store.
(Editor's note: this is common; most real-life C++ implementations work this way. This is why std::is_always_lock_free is so useful: because it's usually true for types where is_lock_free() is ever true.)
std::atomic<T>::is_lock_free() may in some implementations return true or false depending on runtime conditions.
As pointed out by Peter Cordes in comments, the runtime conditions is not alignment, as atomic will (over-)align internal storage for efficient lock-free operations, and forcing misalignment is UB that may manifest as loss of atomicity.
It is possible to make an implementation that will not enforce alignment and would do runtime dispatch based on alignment, but it is not what a sane implementation would do. It only make sense to support pre-C++17, if __STDCPP_DEFAULT_NEW_ALIGNMENT__ is less than required atomic alignment, as overalignment for dynamic allocation does not work until C++17.
Another reason where runtime condition may determine atomicity is runtime CPU dispatch.
On x86-64, an implementation may detect the presence of cmpxchg16b via cpuid at initialization, and use it for 128-bit atomics, the same applies to cmpxchg8b and 64-bit atomic on 32-bit. If corresponding cmpxchg is not found, lock-free atomic is unimplementable, and the implementation uses locks.
MSVC doesn't do runtime CPU dispatch currently. It doesn't do it for 64-bit due to ABI compatibility reasons, and doesn't do it for 32-bit as already doesn't support CPUs without cmpxchg8b. Boost.Atomic doesn't do this by default (assumes cmpxchg8b and cmpxhg16b presence), but can be configured for the detection. I haven't bothered to look what other implementations do yet.

Atomic operations, std::atomic<> and ordering of writes

GCC compiles this:
#include <atomic>
std::atomic<int> a;
int b(0);
void func()
{
b = 2;
a = 1;
}
to this:
func():
mov DWORD PTR b[rip], 2
mov DWORD PTR a[rip], 1
mfence
ret
So, to clarify things for me:
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Why does the MFENCE happen after the write to ‘a’ not before.
Is the write to ‘a’ guaranteed to be an atomic (in the narrow, non C++ sense) operation anyway, and does that apply for all intel processors? I assume so from this output code.
Also, clang (v3.5.1 -O3)does this:
mov dword ptr [rip + b], 2
mov eax, 1
xchg dword ptr [rip + a], eax
ret
Which appears more straightforward to my little mind, but why the different approach, what’s the advantage of each?
I put your example on the Godbolt compiler explorer, and added some functions to read, increment, or combine (a+=b) two atomic variables. I also used a.store(1, memory_order_release); instead of a = 1; to avoid getting more ordering than needed, so it's just a simple store on x86.
See below for (hopefully correct) explanations. update: I had "release" semantics confused with just a StoreStore barrier. I think I fixed all the mistakes, but may have left some.
The easy question first:
Is the write to ‘a’ guaranteed to be an atomic?
Yes, any thread reading a will get either the old or the new value, not some half-written value. This happens for free on x86 and most other architectures with any aligned type that fits in a register. (e.g. not int64_t on 32bit.) Thus, on many systems, this happens to be true for b as well, the way most compilers would generate code.
There are some types of stores that may not be atomic on an x86, including unaligned stores that cross a cache line boundary. But std::atomic of course guarantees whatever alignment is necessary.
Read-modify-write operations are where this gets interesting. 1000 evaluations of a+=3 done in multiple threads at once will always produce a += 3000. You'd potentially get fewer if a wasn't atomic.
Fun fact: signed atomic types guarantee two's complement wraparound, unlike normal signed types. C and C++ still cling to the idea of leaving signed integer overflow undefined in other cases. Some CPUs don't have arithmetic right shift, so leaving right-shift of negative numbers undefined makes some sense, but otherwise it just feels like a ridiculous hoop to jump through now that all CPUs use 2's complement and 8bit bytes. </rant>
Is any other thread reading ‘a’ as 1 guaranteed to read ‘b’ as 2.
Yes, because of the guarantees provided by std::atomic.
Now we're getting into the memory model of the language, and the hardware it runs on.
C11 and C++11 have a very weak memory ordering model, which means the compiler is allowed to reorder memory operations unless you tell it not to. (source: Jeff Preshing's Weak vs. Strong Memory Models). Even if x86 is your target machine, you have to stop the compiler from re-ordering stores at compile time. (e.g. normally you'd want the compiler to hoist a = 1 out of a loop that also writes to b.)
Using C++11 atomic types gives you full sequential-consistency ordering of operations on them with respect to the rest of the program, by default. This means they're a lot more than just atomic. See below for relaxing the ordering to just what's needed, which avoids expensive fence operations.
Why does the MFENCE happen after the write to ‘a’ not before.
StoreStore fences are a no-op with x86's strong memory model, so the compiler just has to put the store to b before the store to a to implement the source code ordering.
Full sequential consistency also requires that the store be globally ordered / globally visible before any later loads in program order.
x86 can re-order stores after loads. In practice, what happens is that out-of-order execution sees an independent load in the instruction stream, and executes it ahead of a store that was still waiting on the data to be ready. Anyway, sequential-consistency forbids this, so gcc uses MFENCE, which is a full barrier, including StoreLoad (the only kind x86 doesn't have for free. (LFENCE/SFENCE are only useful for weakly-ordered operations like movnt.))
Another way to put this is the way the C++ docs use: sequential consistency guarantees that all threads see all changes in the same order. The MFENCE after every atomic store guarantees that this thread sees stores from other threads. Otherwise, our loads would see our stores before other thread's loads saw our stores. A StoreLoad barrier (MFENCE) delays our loads until after the stores that need to happen first.
The ARM32 asm for b=2; a=1; is:
# get pointers and constants into registers
str r1, [r3] # store b=2
dmb sy # Data Memory Barrier: full memory barrier to order the stores.
# I think just a StoreStore barrier here (dmb st) would be sufficient, but gcc doesn't do that. Maybe later versions have that optimization, or maybe I'm wrong.
str r2, [r3, #4] # store a=1 (a is 4 bytes after b)
dmb sy # full memory barrier to order this store wrt. all following loads and stores.
I don't know ARM asm, but what I've figured out so far is that normally it's op dest, src1 [,src2], but loads and stores always have the register operand first and the memory operand 2nd. This is really weird if you're used to x86, where a memory operand can be the source or dest for most non-vector instructions. Loading immediate constants also takes a lot of instructions, because the fixed instruction length only leaves room for 16b of payload for movw (move word) / movt (move top).
Release / Acquire
The release and acquire naming for one-way memory barriers comes from locks:
One thread modifies a shared data structure, then releases a lock. The unlock has to be globally visible after all the loads/stores to data it's protecting. (StoreStore + LoadStore)
Another thread acquires the lock (read, or RMW with a release-store), and must do all loads/stores to the shared data structure after the acquire becomes globally visible. (LoadLoad + LoadStore)
Note that std:atomic uses these names even for standalone fences which are slightly different from load-acquire or store-release operations. (See atomic_thread_fence, below).
Release/Acquire semantics are stronger than what producer-consumer requires. That just requires one-way StoreStore (producer) and one-way LoadLoad (consumer), without LoadStore ordering.
A shared hash table protected by a readers/writers lock (for example) requires an acquire-load / release-store atomic read-modify-write operation to acquire the lock. x86 lock xadd is a full barrier (including StoreLoad), but ARM64 has load-acquire/store-release version of load-linked/store-conditional for doing atomic read-modify-writes. As I understand it, this avoids the need for a StoreLoad barrier even for locking.
Using weaker but still sufficient ordering
Writes to std::atomic types are ordered with respect to every other memory access in source code (both loads and stores), by default. You can control what ordering is imposed with std::memory_order.
In your case, you only need your producer to make sure stores become globally visible in the correct order, i.e. a StoreStore barrier before the store to a. store(memory_order_release) includes this and more. std::atomic_thread_fence(memory_order_release) is just a 1-way StoreStore barrier for all stores. x86 does StoreStore for free, so all the compiler has to do is put the stores in source order.
Release instead of seq_cst will be a big performance win, esp. on architectures like x86 where release is cheap/free. This is even more true if the no-contention case is common.
Reading atomic variables also imposes full sequential consistency of the load with respect to all other loads and stores. On x86, this is free. LoadLoad and LoadStore barriers are no-ops and implicit in every memory op. You can make your code more efficient on weakly-ordered ISAs by using a.load(std::memory_order_acquire).
Note that the std::atomic standalone fence functions confusingly reuse the "acquire" and "release" names for StoreStore and LoadLoad fences that order all stores (or all loads) in at least the desired direction. In practice, they will usually emit HW instructions that are 2-way StoreStore or LoadLoad barriers. This doc is the proposal for what became the current standard. You can see how memory_order_release maps to a #LoadStore | #StoreStore on SPARC RMO, which I assume was included partly because it has all the barrier types separately. (hmm, the cppref web page only mentions ordering stores, not the LoadStore component. It's not the C++ standard, though, so maybe the full standard says more.)
memory_order_consume isn't strong enough for this use-case. This post talks about your case of using a flag to indicate that other data is ready, and talks about memory_order_consume.
consume would be enough if your flag was a pointer to b, or even a pointer to a struct or array. However, no compiler knows how to do the dependency tracking to make sure it puts thing in the proper order in the asm, so current implementations always treat consume as acquire. This is too bad, because every architecture except DEC alpha (and C++11's software model) provide this ordering for free. According to Linus Torvalds, only a few Alpha hardware implementations actually could have this kind of reordering, so the expensive barrier instructions needed all over the place were pure downside for most Alphas.
The producer still needs to use release semantics (a StoreStore barrier), to make sure the new payload is visible when the pointer is updated.
It's not a bad idea to write code using consume, if you're sure you understand the implications and don't depend on anything that consume doesn't guarantee. In the future, once compilers are smarter, your code will compile without barrier instructions even on ARM/PPC. The actual data movement still has to happen between caches on different CPUs, but on weak memory model machines, you can avoid waiting for any unrelated writes to be visible (e.g. scratch buffers in the producer).
Just keep in mind that you can't actually test memory_order_consume code experimentally, because current compilers are giving you stronger ordering than the code requests.
It's really hard to test any of this experimentally anyway, because it's timing-sensitive. Also, unless the compiler re-orders operations (because you failed to tell it not to), producer-consumer threads will never have a problem on x86. You'd need to test on an ARM or PowerPC or something to even try to look for ordering problems happening in practice.
references:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458: I reported the gcc bug I found with b=2; a.store(1, MO_release); b=3; producing a=1;b=3 on x86, rather than b=3; a=1;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461: I also reported the fact that ARM gcc uses two dmb sy in a row for a=1; a=1;, and x86 gcc could maybe do with fewer mfence operations. I'm not sure if an mfence between each store is needed to protect a signal handler from making wrong assumptions, or if it's just a missing optimization.
The Purpose of memory_order_consume in C++11 (already linked above) covers exactly this case of using a flag to pass a non-atomic payload between threads.
What StoreLoad barriers (x86 mfence) are for: a working sample program that demonstrates the need: http://preshing.com/20120515/memory-reordering-caught-in-the-act/
Data-dependency barriers (only Alpha needs explicit barriers of this type, but C++ potentially needs them to prevent the compiler doing speculative loads): http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#360
Control-dependency barriers: http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt#592
Doug Lea says x86 only needs LFENCE for data that was written with "streaming" writes like movntdqa or movnti. (NT = non-temporal). Besides bypassing the cache, x86 NT loads/stores have weakly-ordered semantics.
http://preshing.com/20120913/acquire-and-release-semantics/
http://preshing.com/20120612/an-introduction-to-lock-free-programming/ (pointers to books and other stuff he recommends).
Interesting thread on realworldtech about whether barriers everywhere or strong memory models are better, including the point that data-dependency is nearly free in HW, so it's dumb to skip it and put a large burden on software. (The thing Alpha (and C++) doesn't have, but everything else does). Go back a few posts from that to see Linus Torvalds' amusing insults, before he got around to explaining more detailed / technical reasons for his arguments.

Does int32_t have lower latency than int8_t, int16_t and int64_t?

(I'm referring to Intel CPUs and mainly with GCC, but poss ICC or MSVC)
Is it true using int8_t, int16_t or int64_t is less efficient compared with int32_tdue to additional instructions generated to to convert between the CPU word size and the chosen variable size?
I would be interested if anybody has any examples or best practices for this? I sometimes use smaller variable sizes to reduce cacheline loads, but say I only consumed 50 bytes of a cacheline with one variable being 8-bit int, it may be quicker processing by using the remaining cacheline space and promote the 8-bit int to a 32-bit int etc?
You can stuff more uint8_ts into a cache line, so loading N uint8_ts will be faster than loading N uint32_ts.
In addition, if you are using a modern Intel chip with SIMD instructions, a smart compiler will vectorize what it can. Again, using a small variable in your code will allow the compiler to stuff more lanes into a SIMD register.
I think it is best to use the smallest size you can, and leave the details up to the compiler. The compiler is probably smarter than you (and me) when it comes to stuff like this. For many operations (say unsigned addition), the compiler can use the same code for uint8, uint16 or uint32 (and just ignore the upper bits), so there is no speed difference.
The bottom line is that a cache miss is WAY more expensive than any arithmetic or logical operation, so it is nearly always better to worry about cache (and thus data size) than simple arithmetic.
(It used to be true a long time again that on Sun workstation, using double was significantly faster than float, because the hardware only supported double. I don't think that is true any more for modern x86, as the SIMD hardware (SSE, etc) have direct support for both single and double precision).
Mark Lakata answer points in the right direction.
I would like to add some points.
A wonderful resource for understanding and taking optimization decision are the Agner documents.
The Instruction Tables document has the latency for the most common instructions. You can see that some of them perform better in the native size version.
A mov for example may be eliminated, a mul have less latency.
However here we are talking about gaining 1 clock, we would have to execute a lot of instruction to compensate for a cache miss.
If this were the whole story it would have not worth it.
The real problems comes with the decoders.
When you use some length-changing prefixes (and you will by using non native size word) the decoder takes extra cycles.
The operand size prefix therefore changes the length of the rest of the instruction. The predecoders are unable to resolve this problem in a single clock cycle. It takes 6 clock cycles to recover from this error. It is therefore very important to avoid such length-changing prefixes.
In, nowadays, no longer more recent (but still present) microarchs the penalty was severe, specially with some kind arithmetic instructions.
In later microarchs this has been mitigated but the penalty it is still present.
Another aspect to consider is that using non native size requires to prefix the instructions and thereby generating larger code.
This is the closest as possible to the statement "additional instructions [are] generated to to convert between the CPU word size and the chosen variable size" as Intel CPU can handle non native word sizes.
With other, specially RISC, CPUs this is not generally true and more instruction can be generated.
So while you are making an optimal use of the data cache, you are also making a bad use of the instruction cache.
It is also worth nothing that on the common x64 ABI the stack must be aligned on 16 byte boundary and that usually the compiler saves local vars in the native word size or a close one (e.g. a DWORD on 64 bit system).
Only if you are allocating a sufficient number of local vars or if you are using array or packed structs you can gain benefits from using small variable size.
If you declare a single uint16_t var, it will probably takes the same stack space of a single uint64_t, so it is best to go for the fastest size.
Furthermore when it come to the data cache it is the locality that matters, rather than the data size alone.
So, what to do?
Luckily, you don't have to decide between having small data or small code.
If you have a considerable quantity of data this is usually handled with arrays or pointers and by the use of intermediate variables. An example being this line of code.
t = my_big_data[i];
Here my approach is:
Keep the external representation of data, i.e. the my_big_data array, as small as possible. For example if that array store temperatures use a coded uint8_t for each element.
Keep the internal representation of data, i.e. the t variable, as close as possible to the CPU word size. For example t could be a uint32_t or uint64_t.
This way you program optimize both caches and use the native word size.
As a bonus you may later decide to switch to SIMD instructions without have to repack the my_big_data memory layout.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
D. Knuth
When you design your structures memory layout be problem driven. For example, age values need 8 bit, city distances in miles need 16 bits.
When you code the algorithms use the fastest type the compiler is known to have for that scope. For example integers are faster than floating point numbers, uint_fast8_t is no slower than uint8_t.
When then it is time to improve the performance start by changing the algorithm (by using faster types, eliminating redundant operations, and so on) and then if it is needed the data structures (by aligning, padding, packing and so on).

std::atomic treat a pair of atomic int32 as one atomic int64?

I have a pair of unsigned int32
std::atomic<u32> _start;
std::atomic<u32> _end;
Sometimes I want to set start or end with compare exchange, so I don't want spurious failures that could be caused by using CAS on the entire 64bit pair. I just want to use 32 bit CAS.
_end.compare_exchange_strong(old_end, new_end);
Now I could fetch both start and end as one atomic 64bit read. Or two separate 32 bit reads. Wouldn't it be faster to do one 64bit atomic fetch (with the compiler adding the appropriate memory fence) rather than two separate 32 atomic bit reads with two memory fences (or would the compiler optimize that away?)
If so, how would I do that in c++11?
The standard doesn't guarantee that std::atomics have the same size as the underlying type, nor that the operations on atomic are lockfree (although they are likely to be for uint32 at least). Therefore I'm pretty sure there isn't any conforming way to combine them into one 64bit atomic operation. So you need to decide whether you want to manually combine the two variables into a 64bit one (and only work with 64bit operations) or not.
As an example the platform might not support 64bit CAS (for x86 that was added with the first Pentium IIRC, so it would not be availible when compiling 486 compatible. In that case it needs to lock somehow, so the atomic might contain both the 64bit variable and the lock. Of
Concerning the fences: Well that depends on the memory_order you specify for your operation. If the memory order specifies that the two operations need to be visible in the order they are excuted in, the compiler will obviously not be able to optimize a fence away, otherwise it might. Of course assuming you target x86 only memory_order_seq_cst will actually emit a barrierinstruction from what I remember, so anything less would impede with instruction reordering done by the compiler, but wouldn't have an actual penalty).
Of course depending on your platform you might get away with treating two std::atomic<int32> as one of int64 doing the casting either via union or reinterpret_cast, just be advised that this behaviour is not required by the standard and can (at least theoretically) stop working at anytime (new compiler verison, different optimization settings,...)
If your two ints require atomic updates, then you must treat them as a single atomic 64-bit value, you really have no other choice. Separate integer updates are not atomic and not viable. I concur that Unions are not relevent here, and suggest that instead you simply cast the pair of integers as (INT64) and perform your Cas64.
Using a critical section is overkill--use the Cas64, they only cost about 33 machine cycles (unoposed), while critical sections cost more like 100 cycles unoposed.
Note that it is commonplace to perform this exact same operation on versioned pointers, which in 32-bit consist of a 32-bit pointer and a 32-bit pointer, updated together as one, using Cas64 as described. Also, they actually do have to "line up right", because you never want such values to overhang a cache line boundary.