std::atomic treat a pair of atomic int32 as one atomic int64? - c++

I have a pair of unsigned int32
std::atomic<u32> _start;
std::atomic<u32> _end;
Sometimes I want to set start or end with compare exchange, so I don't want spurious failures that could be caused by using CAS on the entire 64bit pair. I just want to use 32 bit CAS.
_end.compare_exchange_strong(old_end, new_end);
Now I could fetch both start and end as one atomic 64bit read. Or two separate 32 bit reads. Wouldn't it be faster to do one 64bit atomic fetch (with the compiler adding the appropriate memory fence) rather than two separate 32 atomic bit reads with two memory fences (or would the compiler optimize that away?)
If so, how would I do that in c++11?

The standard doesn't guarantee that std::atomics have the same size as the underlying type, nor that the operations on atomic are lockfree (although they are likely to be for uint32 at least). Therefore I'm pretty sure there isn't any conforming way to combine them into one 64bit atomic operation. So you need to decide whether you want to manually combine the two variables into a 64bit one (and only work with 64bit operations) or not.
As an example the platform might not support 64bit CAS (for x86 that was added with the first Pentium IIRC, so it would not be availible when compiling 486 compatible. In that case it needs to lock somehow, so the atomic might contain both the 64bit variable and the lock. Of
Concerning the fences: Well that depends on the memory_order you specify for your operation. If the memory order specifies that the two operations need to be visible in the order they are excuted in, the compiler will obviously not be able to optimize a fence away, otherwise it might. Of course assuming you target x86 only memory_order_seq_cst will actually emit a barrierinstruction from what I remember, so anything less would impede with instruction reordering done by the compiler, but wouldn't have an actual penalty).
Of course depending on your platform you might get away with treating two std::atomic<int32> as one of int64 doing the casting either via union or reinterpret_cast, just be advised that this behaviour is not required by the standard and can (at least theoretically) stop working at anytime (new compiler verison, different optimization settings,...)

If your two ints require atomic updates, then you must treat them as a single atomic 64-bit value, you really have no other choice. Separate integer updates are not atomic and not viable. I concur that Unions are not relevent here, and suggest that instead you simply cast the pair of integers as (INT64) and perform your Cas64.
Using a critical section is overkill--use the Cas64, they only cost about 33 machine cycles (unoposed), while critical sections cost more like 100 cycles unoposed.
Note that it is commonplace to perform this exact same operation on versioned pointers, which in 32-bit consist of a 32-bit pointer and a 32-bit pointer, updated together as one, using Cas64 as described. Also, they actually do have to "line up right", because you never want such values to overhang a cache line boundary.

Related

Compare and Swap : synchronizing via different data sizes

Using the GCC builtin C atomic primitives, we can perform an atomic CAS operation using __atomic_compare_exchange.
Unlike C++11's std::atomic type, the GCC C atomic primitives operate on regular non-atomic integral types, including 128-bit integers on platforms where cmpxchg16b is supported. (A future version of the C++ standard may support similar functionality with the std::atomic_view class template.)
This makes me question:
What happens if an atomic CAS operation on a larger data size observes a change which happened by an atomic operation on the same memory location, but using a smaller data size?
For example, suppose we have:
struct uint128_type {
uint64_t x;
uint64_t y;
} __attribute__ ((aligned (16)));
And suppose we have a shared variable of type uint128_type, like:
uint128_type Foo;
Now, suppose Thread A does:
Foo expected = { 0, 0 };
Foo desired = { 100, 100 };
int result = __atomic_compare_exchange(
&Foo,
&expected,
&desired,
0,
__ATOMIC_SEQ_CST
);
And Thread B does:
uint64_t expected = 0;
uint64_t desired = 500;
int result = __atomic_compare_exchange(
&Foo.x,
&expected,
&desired,
0,
__ATOMIC_SEQ_CST
);
What happens if Thread A's 16-byte CAS happens before Thread B's 8 byte CAS (or vice-versa)? Does the CAS fail as normal? Is this even defined behavior? Is this likely to "do the right thing" on typical architectures like x86_64 that support 16b CAS?
Edit: to be clear, since it seems to be causing confusion, I'm not asking if the above behavior is defined by the C++ standard. Obviously, all the __atomic_* functions are GCC extensions. (However future C++ standards may have to define this sort of thing, if std::atomic_view becomes standardized.) I am asking more generally about the semantics of atomic operations on typical modern hardware. As an example, if x86_64 code has 2 threads perform atomic operations on the same memory address, but one thread uses CMPXCHG8b and the other uses CMPXCHG16b, so that one does an atomic CAS on a single word while the other does an atomic CAS on a double word, how are the semantics of these operations defined? More specifically, would a CMPXCHG16b fail because it observes that the data has mutated from the expected value due to a previous CMPXCHG8b?
In other words, can two different CAS operations using two different data sizes (but the same starting memory address) safely be used to synchronize between threads?
One or the other happens first, and each operation proceeds according to its own semantics.
On x86 CPUs, both operations will require a lock on the same cache line held throughout the entire operation. So whichever one gets that lock first will not see any effects of the second operation and whichever one gets that lock second will see all the effects of the first operation. The semantics of both operations will be fully respected.
Other hardware may achieve this result some other way, but if it doesn't achieve this result, it's broken unless it specifies that it has a limitation.
The atomic data will, eventually, located somewhere in the memory and all accesses to it (or to respective caches, when the operations are atomic) will be serialized. Since the CAS operation is supposed to be atomic, it will performed as a whole or not at all.
That being said, one of the operations will succeed, the second will fail. The order is non-deterministic.
From x86 Instruction Set Reference:
This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)
Clearly, both threads will attempt to locked-write after locked-read (when used with LOCK prefix, that is), which means only one of them will succeed in performing the CAS, the other will read already changed value.
Hardware is usually very conservative when checking for conflicts between potentially conflicting atomic operations. It may even happen that two CAS operations to two completely different, non-overlapping address ranges may be detected to conflict with each other.
The definition of atomic is unlikely to change, emphasis mine
In concurrent programming, an operation (or set of operations) is atomic,
linearizable, indivisible or uninterruptible if it appears to
the rest of the system to occur instantaneously. Atomicity is a
guarantee of isolation from concurrent processes.
Your question asks...
What happens if an atomic CAS operation on a larger data size observes
a change which happened by an atomic operation on the same memory
location, but using a smaller data size?
By definition no two overlapping regions of memory modified using atomic operations can mutate concurrently ie the two operations must happen linearly or their not atomic.

Detect at runtime if a load is atomic?

My application requires a couple of atomic loads and stores. Unfortunately, these operations must occur at a particular address in a memory mapped file and so I cannot use c++11's std::atomic (since std::atomic works by controlling the size and alignment/placement of the variable. Since I control the memory mapped file format and we only run on a single CPU architecture I simply looked up the alignment and size constraints of my target and arranged things to allow for the atomicity (including a full fence at just the right spot).
Is there a way to test (at runtime) if a read or a write of a particular size to a particular address would be atomic? My primary platform is x86-64 but I'd also be interested in solutions for ARM.
Short answer: Probably not. You could write some code that updates and checks validity of your values, and run that for 6 months. However, it will almost certainly not GUARANTEE that the code is correct - just that you haven't hit the right spot to make it go wrong.
Long answer: Loads and stores of processor words are almost certainly atomic in and of themselves, however the std::atomic functionality provides a stronger guarantee than that. It guarantees that no processor will use a value that is "stale" (cache-coherency and exclusive updates). You can't make the same guarantee without std::atomic (unless the platform itself guarantees this, which would probably require it to be a single core processor at the very least).
In a generic case, loads and stores as performed by the processor has a "weak" cache coherency and atomic update policy. Say we have this code:
int a = 42;
bool flag = false;
...
a++;
flag = true;
and some other code that does:
while(!flag)
a--;
[I'm currently ignoring the fact that flag also needs atomic update policy and probably needs to be volatile - that is not the point in this case]
There is no guarantee that the compiler doesn't form tmp = a; tmp = tmp+1; a = tmp; (and corresponding for the a--) [possibly with extra instructions thrown in between for good measure, because the compiler expects that to be faster/better in some other way].
There is also no guarantee that the value won't have been updated to 43, but the other code has read the old 42 value after it exits the loop because flag is true (because the processor is not doing everything in exactly the order you expect, and the cache-content is updated "out of order").
x86-processors are definitely among the ones that have not got guarantees that the updated value will be immediately available as described above. Cache coherency and atomicity just guarantees that you won't read some "half-baked" value - it's either the "old" or the "new" value when updating a value - but it may be the "old" value for quite some time after the "new" value has been written, which is typically "not a good thing".

Determine if double word CAS is supported

With the C++11 <atomic> library, you can take advantage of double-word CAS via the normal atomic<T>::compare_and_exchange_X family of functions by simply using a 16-byte type for T. The compiler will fall back on spin locks, apparently, if the architecture doesn't actually support double-word CAS.
A common use for double-word CAS is the implementation of tagged pointers - pointers which carry an extra counter value in order to avoid the ABA problem that plagues many lockfree data-structures and algorithms.
However - a lot of professional implementations of lockfree algorithms (such as Boost.Lockfree) don't actually use double-word CAS - instead they rely on single-word CAS and opt to use non-portable pointer-stuffing shenanigans. Since x86_64 architectures only use the first 48 bits of a pointer, you can use the hi 16-bits to stuff in a counter.
Of course, the pointer-stuffing technique only really works for x86_64 as far as I know - other 64-bit architectures don't necessarily guarantee that the hi 16-bits will be available. Additionally, the 16-bit counter only gives you a max value of 65536, so theoretically, in some absurdly pathological case, the ABA problem could still happen if one thread is pre-empted and then another thread(s) somehow performs more than 65,536 operations before the original thread wakes up again (I know, I know, a very unlikely scenario - but still insanely MORE likely than if the counter variable was 64-bit...)
Because of the above considerations, it seems that IF double word CAS is supported on some architecture, tagged pointer implementations should prefer to use a 16-byte pointer struct containing the actual pointer and a counter variable, rather than 8-byte stuffed pointers.
But... is there any way to actually determine at compile time if the CMPXCHG16B is supported? I know most modern x86_64 platforms support it, but some older computers don't. Therefore, it seems the ideal thing to do would be to check if double word CAS is supported, and if so use 16-byte pointers. If not, fall back on 8-byte stuffed pointers.
So, is there any way to actually check (via some #ifdefing) if CMPXCHG16B is supported?

CRITICAL_SECTION for set and get single bool value

now writing complicated class and felt that I use to much CRITICAL_SECTION.
As far as I know there are atomic operations for some types, that are always executed without any hardware or software interrupt.
I want to check if I understand everything correctly.
To set or get atomic value we don't need CRITICAL_SECTION because doing that there won't be interrupts.
bool is atomic.
So there are my statements, want to ask, if they are correct, also if they are correct, what types variables may be also set or get without CRITICAL_SECTION?
P. S. I'm talking about getting or setting one single value per method, not two, not five, but one.
You don't need locks round atomic data, but internally they might lock. Note for example, C++11's std::atomic has a is_lock_free function.
bool may not be atomic. See here and here
Note: This answer applies to Windows and says nothing of other platforms.
There are no InterlockedRead or InterlockedWrite functions; simple reads and writes with the correct integer size (and alignment) are atomic on Windows ("Simple reads and writes to properly-aligned 32-bit variables are atomic operations.").
(and there are no cache problems since a properly-aligned variable is always on a single cache line).
However, reading and modifying such variables (or any other variable) are not atomic:
Read a bool? Fine. Test-And-Set a bool? Better use
InterlockedCompareExchange.
Overwrite an integer? great! Add to
it? Critical section.
Here this can be found:
Simple reads and writes to properly aligned 64-bit variables are
atomic on 64-bit Windows. Reads and writes to 64-bit values are not
guaranteed to be atomic on 32-bit Windows. Reads and writes to
variables of other sizes are not guaranteed to be atomic on any
platform.
Result should be correct but in programming it is better not to trust should. There still remains small possibility of failure because of CPU cache.
You cannot guarantee for all implementations/platforms/compilers that bool, or any other type, or most operations, are atomic. So, no, I don't believe your statements are correct. You can retool your logic or use other means of establishing atomicity, but you probably can't get away with just removing CRITICAL_SECTION usage if you rely on it.

compare-and-swap atomic operation vs Load-link/store-conditional operation

Under an x86 processor I am not sure of the difference between compare-and-swap atomic operation and Load-link/store-conditional operation. Is the latter safer than the former? Is it the case that the first is better than the second?
There are three common styles of atomic primitive: Compare-Exchange, Load-Linked/Store-Conditional, and Compare-And-Swap.
A CompareExchange operation will atomically read a memory location and, if it matches a compare value, store a specified new value. If the value that was read does not match the compare value, no store takes place. In any case, the operation will report the original value read.
A Compare-And-Swap operation is similar to CompareExchange, except that it does not report what value was read--merely whether whatever value was read matched the compare value. Note that a CompareExchange may be used to implement Compare-And-Swap by having it report whether the value read from memory matched the specified compare value.
The LL/SC combination allows a store operation to be conditioned upon whether some outside influence might have affected the target since its value was loaded. In particular, it guarantees that if the store succeeds, the location has not been written at all by outside code. Even if outside code wrote a new value and then re-wrote the original value, that would be guaranteed to cause the conditional code to fail. Conceptually, this might make LL/SC seem more powerful than other methods, since it wouldn't have the "ABA" problem. Unfortunately, LL/SC semantics allow for stores to spontaneously fail, and the probability of spontaneously failure may go up rapidly as the complexity of the code between the load and store is increased. While using LL/SC to implement something like an atomic increment directly would be more efficient than using it to implement a compare-and-swap, and then implementing an atomic increment using that compare-and-swap implementation, in situations where one would need to do much between a load and store, one should generally use LL-SC to implement a compare-and-swap, and then use that compare-and-swap method in a load-modify-CompareAndSwap loop.
Of the three primitives, the Compare-And-Swap is the least powerful, but it can be implemented in terms of either of the other two. CompareAndSwap can do a pretty good job of emulating CompareExchange, though there are some corner cases where such emulation might live-lock. Neither CompareExchange nor Compare-And-Swap can offer guarantees quite as strong as LL-SC, though the limited amount of code one can reliably place within an LL/SC loop limits the usefulness of its guarantees.
x86 does not provide LL/SC instructions. Check out wikipedia for platforms that do. Also see this SO question.