In C++20, we got the capability to sleep on atomic variables, waiting for their value to change.
We do so by using the std::atomic::wait method.
Unfortunately, while wait has been standardized, wait_for and wait_until are not. Meaning that we cannot sleep on an atomic variable with a timeout.
Sleeping on an atomic variable is anyway implemented behind the scenes with WaitOnAddress on Windows and the futex system call on Linux.
Working around the above problem (no way to sleep on an atomic variable with a timeout), I could pass the memory address of an std::atomic to WaitOnAddress on Windows and it will (kinda) work with no UB, as the function gets void* as a parameter, and it's valid to cast std::atomic<type> to void*
On Linux, it is unclear whether it's ok to mix std::atomic with futex. futex gets either a uint32_t* or a int32_t* (depending which manual you read), and casting std::atomic<u/int> to u/int* is UB. On the other hand, the manual says
The uaddr argument points to the futex word. On all platforms,
futexes are four-byte integers that must be aligned on a four-
byte boundary. The operation to perform on the futex is
specified in the futex_op argument; val is a value whose meaning
and purpose depends on futex_op.
Hinting that alignas(4) std::atomic<int> should work, and it doesn't matter which integer type is it is as long as the type has the size of 4 bytes and the alignment of 4.
Also, I have seen many places where this trick of combining atomics and futexes is implemented, including boost and TBB.
So what is the best way to sleep on an atomic variable with a timeout in a non UB way?
Do we have to implement our own atomic class with OS primitives to achieve it correctly?
(Solutions like mixing atomics and condition variables exist, but sub-optimal)
You shouldn't necessarily have to implement a full custom atomic API, it should actually be safe to simply pull out a pointer to the underlying data from the atomic<T> and pass it to the system.
Since std::atomic does not offer some equivalent of native_handle like other synchronization primitives offer, you're going to be stuck doing some implementation-specific hacks to try to get it to interface with the native API.
For the most part, it's reasonably safe to assume that first member of these types in implementations will be the same as the T type -- at least for integral values [1]. This is an assurance that will make it possible to extract out this value.
... and casting std::atomic<u/int> to u/int* is UB
This isn't actually the case.
std::atomic is guaranteed by the standard to be Standard-Layout Type. One helpful but often esoteric properties of standard layout types is that it is safe to reinterpret_cast a T to a value or reference of the first sub-object (e.g. the first member of the std::atomic).
As long as we can guarantee that the std::atomic<u/int> contains only the u/int as a member (or at least, as its first member), then it's completely safe to extract out the type in this manner:
auto* r = reinterpret_cast<std::uint32_t*>(&atomic);
// Pass to futex API...
This approach should also hold on windows to cast the atomic to the underlying type before passing it to the void* API.
Note: Passing a T* pointer to a void* that gets reinterpreted as a U* (such as an atomic<T>* to void* when it expects a T*) is undefined behavior -- even with standard-layout guarantees (as far as I'm aware). It will still likely work because the compiler can't see into the system APIs -- but that doesn't make the code well-formed.
Note 2: I can't speak on the WaitOnAddress API as I haven't actually used this -- but any atomics API that depends on the address of a properly aligned integral value (void* or otherwise) should work properly by extracting out a pointer to the underlying value.
[1] Since this is tagged C++20, you can verify this with std::is_layout_compatible with a static_assert:
static_assert(std::is_layout_compatible_v<int,std::atomic<int>>);
(Thanks to #apmccartney for this suggestion in the comments).
I can confirm that this will be layout compatible for Microsoft's STL, libc++, and libstdc++; however if you don't have access to is_layout_compatible and you're using a different system, you might want to check your compiler's headers to ensure this assumption holds.
You could use a "non-atomic" alignas(4) uint32_t variable with the futex calls, and perform other atomic operations on them via std::atomic_ref. See non-atomic operations on atomic variables and vice versa
Related
I wrote some multithreaded but lock-free code that compiled and apparently executed fine on an earlier C++11-supporting GCC (7 or older). The atomic fields were ints and so on. To the best of my recollection, I used normal C/C++ operations to operate on them (a=1;, etc.) in places where atomicity or event ordering wasn't a concern.
Later I had to do some double-width CAS operations, and made a little struct with a pointer and counter as is common. I tried doing the same normal C/C++ operations, and errors came that the variable had no such members. (Which is what you'd expect from most normal templates, but I half-expected atomic to work differently, in part because normal assignments to and from were supported, to the best of my recollection, for ints.).
So two part question:
Should we use the atomic methods in all cases, even (say) initialization done by one thread with no race conditions? 1a) so once declared atomic there's no way to access unatomically? 1b) we also have to use the verboser verbosity of the atomic<> methods to do so?
Otherwise, if for integer types at least, we can use normal C/C++ operations. But in this case will those operations be the same as load()/store() or are they merely normal assignments?
And a semi-meta question: is there any insight as to why normal C/C++ operations aren't supported on atomic<> variables? I'm not sure if the C++11 language as spec'd has the power to write code that does that, but the spec can certainly require the compiler to do things the language as spec'd isn't powerful enough to do.
You're maybe looking for C++20 std::atomic_ref<T> to give you the ability to do atomic ops on objects that can also be accessed non-atomically. Make sure your non-atomic T object is declared with sufficient alignment for atomic<T>. e.g.
alignas(std::atomic_ref<long long>::required_alignment)
long long sometimes_shared_var;
But that requires C++20, and nothing equivalent is available in C++17 or earlier. Once an atomic object is constructed, I don't think there's any guaranteed portable safe way to modify it other than its atomic member functions.
Its internal object representation isn't guaranteed by the standard so memcpy to get the struct sixteenbyte object out of an atomic<sixteenbyte> efficiently isn't guaranteed by the standard to be safe even if no other thread has a reference to it. You'd have to know how a specific implementation stores it. Checking sizeof(atomic<T>) == sizeof(T) is a good sign, though, and mainstream implementations do in practice just have a T as the object-representation for atomic<T>.
Related: How can I implement ABA counter with c++11 CAS? for a nasty union hack ("safe" in GNU C++) to give efficient access to a single member, because compilers don't optimize foo.load().ptr to just atomically load that member. Instead GCC and clang will lock cmpxchg16b to load the whole pointer+counter pair, then just the first member. C++20 atomic_ref<> should solve that.
Accessing members of atomic<struct foo>: one reason for not allowing shared.x = tmp; is that it's the wrong mental model. If two different threads are storing to different members of the same struct, how does the language define any ordering for what other threads see? Plus it was probably considered too easy for programmer to design their lockless algorithms incorrectly if stuff like that were allowed.
Also, how would you even implement that? Return an lvalue-reference? It can't be to the underlying non-atomic object. And what if the code captures that reference and keeps using it long after calling some function that's not load or store?
Remember that ISO C++'s ordering model works in terms of synchronizes-with, not in terms of local reordering and a single cache-coherent domain like the way real ISAs define their memory models. The ISO C++ model is always strictly in terms of reading, writing, or RMWing the entire atomic object. So a load of the object can always sync-with any store of the whole object.
In hardware that would actually still work for a store to one member and a load from a different member if the whole object is in one cache line, on real-world ISAs. At least I think so, although possibly not on some SMT systems. (Being in one cache line is necessary for lock-free atomic access to the whole object to be possible on most ISAs.)
we also have to use the verboser verbosity of the atomic<> methods to do so?
The member functions of atomic<T> include overloads of all the operators, including operator= (store) and cast back to T (load). a = 1; is equivalent to a.store(1, std::memory_order_seq_cst) for atomic<int> a; and is the slowest way to set a new value.
Should we use the atomic methods in all cases, even (say) initialization done by one thread with no race conditions?
You don't have any choice, other than passing args to the constructors of std::atomic<T> objects.
You can use mo_relaxed loads / stores while your object is still thread-private, though. Avoid any RMW operators like +=. e.g. a.store(a.load(relaxed) + 1, relaxed); will compile about the same as for non-atomic objects of register-width or smaller.
(Except that it can't optimize away and keep the value in a register, so use local temporaries instead of actually updating the atomic object).
But for atomic objects too large to be lock-free, there's not really anything you can do efficiently except construct them with the right values in the first place.
The atomic fields were ints and so on. ...
and apparently executed fine
If you mean plain int, not atomic<int> then it wasn't portably safe.
Data-race UB doesn't guarantee visible breakage, the nasty thing with undefined behaviour is that happening to work in your test case is one of the things that's allowed to happen.
And in many cases with pure load or pure store, it won't break, especially on strongly ordered x86, unless the load or store can hoist or sink out of a loop. Why is integer assignment on a naturally aligned variable atomic on x86?. It'll eventually bite you when a compiler manages to do cross-file inlining and reorder some operations at compile time, though.
why normal C/C++ operations aren't supported on atomic<> variables?
... but the spec can certainly require the compiler to do things the language as spec'd isn't powerful enough to do.
This in fact was a limitation of C++11 through 17. Most compilers have no problem with it. For example implementation of the <atomic> header for gcc/clang's uses __atomic_ builtins which take a plain T* pointer.
The C++20 proposal for atomic_ref is p0019, which cites as motivation:
An object could be heavily used non-atomically in well-defined phases
of an application. Forcing such objects to be exclusively atomic would
incur an unnecessary performance penalty.
3.2. Atomic Operations on Members of a Very Large Array
High-performance computing (HPC) applications use very large arrays. Computations with these arrays typically have distinct phases that allocate and initialize members of the array, update members of the array, and read members of the array. Parallel algorithms for initialization (e.g., zero fill) have non-conflicting access when assigning member values. Parallel algorithms for updates have conflicting access to members which must be guarded by atomic operations. Parallel algorithms with read-only access require best-performing streaming read access, random read access, vectorization, or other guaranteed non-conflicting HPC pattern.
All of these things are a problem with std::atomic<>, confirming your suspicion that this is a problem for C++11.
Instead of introducing a way to do non-atomic access to std::atomic<T>, they introduced a way to do atomic access to a T object. One problem with this is that atomic<T> might need more alignment than a T would get by default, so be careful.
Unlike with giving atomic access to members of T, you could plausible have a .non_atomic() member function that returned an lvalue reference to the underlying object.
I am wondering how can std::atomic_ref be implemented efficiently (one std::mutex per object) for non-atomic objects as the following property seems rather hard to enforce:
Atomic operations applied to an object through an atomic_ref are atomic with respect to atomic operations applied through any other atomic_ref referencing the same object.
In particular, the following code:
void set(std::vector<Big> &objs, size_t i, const Big &val) {
std::atomic_ref RefI{objs[i]};
RefI.store(val);
}
Seems quite difficult to implement as the std::atomic_ref would need to somehow pick every time the same std::mutex (unless it is a big master lock shared by all objects of the same type).
Am I missing something? Or each object is responsible to implement std::atomic_ref and therefore either be atomic or carry a std::mutex?
The implementation is pretty much exactly the same as std::atomic<T> itself. This is not a new problem.
See Where is the lock for a std::atomic? A typical implementation of std::atomic / std::atomic_ref a static hash table of locks, indexed by address, for non-lock-free objects. Hash collisions only lead to extra contention, not a correctness problem. (Deadlocks are still impossible; the locks are only used by atomic functions which never try to take 2 at a time.)
On GCC for example, std::atomic_ref is just another way to invoke __atomic_store on an object. (See the GCC manual: atomic builtins)
The compiler knows whether T is small enough to be lock-free or not. If not, it calls the libatomic library function which will use the lock.
Fun fact: that means it only works if the object has sufficient alignment for atomic<T>. But on many 32-bit platforms including x86, uint64_t might only have 4-byte alignment. atomic_ref on such an object will compile and run, but not actually be atomic if the compiler uses an SSE 8-byte load/store in 32-bit mode to implement it. Fortunately there's no danger for objects that have alignof(T) == sizeof(T), like most primitive types on 64-bit architectures.
This is why you need to allocate the underlying non-atomic object with the required alignment, e.g.
alignas(std::atomic_ref<T>::required_alignment) T foo;
or check that it must be sufficiently aligned already, e.g.
static_assert( std::atomic_ref<T>::required_alignment) == alignof(T), "T isn't *guaranteed* aligned enough for atomic_ref" );
See https://en.cppreference.com/w/cpp/atomic/atomic_ref/required_alignment
An implementation can use a hash based on the address of the object to determine which of a set of locks to acquire while performing the operation.
I was wondering what the differences are between accessing a boolean value using Windows' interlockedXXX functions and using std::atomic_flag.
To my knowledge, both of them are lock-less and you can't set or read an atomic_flag directly. I wonder whether there are more differences.
std::atomic_flag serves basically as a primitive for building other synchronization primitives. In case one needs to set or read, it might make more sense to compare with std::atomic<bool>.
However, there are some additional (conceptual) differences:
With interlockedXXX, you won't get portable code.
interlockedXXX is a function, while std::atomic_flag (as well as std::atomic) is a type. That's a significant difference, since, you can use interlockedXXX with any suitable memory location, such as an element of std::vector. On the contrary, you cannot make a vector of C++ atomic flags or atomic bools, since the corresponding types do not meet the vector value type requirements. 1
You can see the latter difference in the code created by #RmMm, where flag is an ordinary variable. I also added a case with atomic<bool> and you may notice that all the three variants produce the very same assembly:
https://godbolt.org/z/9xwRV6
[1] This problem should be addressed by std::atomic_ref in C++20.
In C and C++ I usually access memory mapped hardware registers with the well known pattern:
typedef unsigned int uint32_t;
*((volatile uint32_t*)0xABCDEDCB) = value;
As far as I know, the only thing guaranteed by the C or C++ standard is that accesses to volatile variables are evaluated strictly according to the rules of the abstract machine.
How can I be sure that the compiler will not generate torn stores for the access for a 32-bit processor? For example the compiler is allowed to emit two 16-bit stores instead of a one 32-bit store, isn't it?
Are there any guarantees in this area made by gcc?
When speeking about MCUs, as far as I know there are no such guarantees. Even more, each case of accessing HW registers may be device specific and often may have its own sequence, rules and/or set of assembler instructions. And it depends on compiler implementation, too.
The only thing here that works for me is reading datasheets concering concrete devices/compilers and follow the examples.
If you are really worried use inline assembler. A single assembler instruction will not return until completed.
Also you must ensure that the memory page you are writing to is not cached otherwise the write may not be all the way through. On ARM memory barriers may be necessary as well.
Volatile is just an instruction which tells the compiler to make no assuptions about the content of the memory since the value may be changed outside one's program but has no effect or read write ordering. Use memory barriers or atomics if this is an issue.
Microsoft comment about ISO compliant usage of volatile
"The volatile keyword in C++11 ISO Standard code is to be used only for hardware access"
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
At least in the case of Microsoft C++ (going back to Visual Studio 2005), an example of a pointer to volatile type is shown:
http://msdn.microsoft.com/en-us/library/145yc477.aspx
Another reference, in this case C, which also includes examples of pointers to volatile types.
"static volatile objects model memory-mapped I/O ports, and static const volatile objects model memory-mapped input ports"
http://en.cppreference.com/w/c/language/volatile
Operations on volatile types are not allowed to be reordered by compiler or hardware, a requirement for hardware memory mapped access. However operations to a combination of volatile and non-volatile types may end up with reordered operations on the non-volatile types, making them non thread safe (all inter thread sharing of variables would require all of them to be volatile to be thread safe). Even if two threads only share volatile types, there's still a data race issue (one thread reads just before the other thread writes).
Microsoft compilers have a non-portable (to other compilers) extension to volatile, that makes them thread safe (/volatile:ms - Microsoft specific, used by default except for ARM processors).
Back to the original question, in the case of GCC, you can have the compiler generate assembly code to verify the operation is safe.
How can I be sure that the compiler will not generate torn stores for the access for a 32-bit processor? For example the compiler is allowed to emit two 16-bit stores instead of a one 32-bit store, isn't it?
Normally, the compiler can combine or split memory accesses under the as-if rule, as long as the observable behavior of the program is unchanged, since the observable behavior of access to ordinary objects is the effect on the object's value, and not the memory access itself.
However, accesses to volatile objects are part of the observable behavior of a program. Therefore the compiler can no longer combine or split memory transactions. In the section where the C++ Standard defines "observable behavior" it specifically says that "Access to volatile objects are evaluated strictly according to the rules of the abstract machine."
Please note that the code shown is still non-portable C++, because the C++ Standard only cares about whether the object accessed is volatile, and not about modifiers on the pointer used to form an lvalue for said access. You'd need to do something crazy like this example of placement-new, to force the existence of a volatile object:
*(new volatile uint32 ((uint32*)0xABCDEDCB)) = value;
Of course, there's no such thing in std, but I need equivalent functionality.
I have a lock-free data structure templated on a type T, where T is provided by the user, and what I need to statically assert is that T is a type that is atomically assignable on x86 or x86-64 (which includes all built-in integral constants and floating point types, and any typedef thereof, but I think is not necessarily limited to those). I'm guessing that merely checking that the type is trivially assignable and that its sizeof is <= 8 is not sufficient. What's the best way to do this? Forcing T to be an std::atomic<> and then checking is_lock_free() is out of the question.
"atomically assignable" is not sufficient condition for using a type to implement a lock-free data structure, so this idea is going down to the wrong path from the start.
Using std::atomic (and friends) is the only way in C++ to have both the atomicity and the ordering guarantees necessary to implement a lock-free data structure. Atomic assignment is useless if no other thread will ever see it.