Must I call atomic load/store explicitly?

Must I call atomic load/store explicitly? - c++

C++11 introduced the std::atomic<> template library. The standard specifies the store() and load() operations to atomically set / get a variable shared by more than one thread.
My question is are assignment and access operations also atomic?
Namely, is:
std::atomic<bool> stop(false);
...
void thread_1_run_until_stopped()
{
if(!stop.load())
/* do stuff */
}
void thread_2_set_stop()
{
stop.store(true);
}
Equivalent to:
void thread_1_run_until_stopped()
{
if(!stop)
/* do stuff */
}
void thread_2_set_stop()
{
stop = true;
}

Are assignment and access operations for non-reference types also atomic?
Yes, they are. atomic<T>::operator T and atomic<T>::operator= are equivalent to atomic<T>::load and atomic<T>::store respectively. All the operators are implemented in the atomic class such that they will use atomic operations as you would expect.
I'm not sure what you mean about "non-reference" types? Not sure how reference types are relevant here.

You can do both, but the advantage of load()/store() is that they allow to specify memory order. It is important sometimes for performance, where you can specify std::memory_order_relaxed while atomic<T>::operator T and atomic<T>::operator= would use the most safe and slow std::memory_order_seq_cst. Sometimes it is important for correctness and readability of your code: although the default std::memory_order_seq_cst is most safe thus most likely to be correct, it is not immediately clear for the reader what kind of operation (acquire/release/consume) you are doing, or whether you are doing such operation at all (to answer: isn't relaxed order sufficient here?).

Related

Boolean stop signal between threads

What is the simplest way to signal a background thread to stop executing?
I have used something like:
volatile bool Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
Is there anything wrong with this? I know for complex cases you might need "atomic" or mutexes, but for just boolean signaling this should work, right?

std::atomic is not for "complex cases". It is for when you need to access something from multiple threads. There are some myths about volatile, I cannot recall them, because all I remember is that volatile does not help when you need to access something from different threads. You need a std::atomic<bool>. Whether on your actual hardware accessing a bool is atomic does not really matter, because as far as C++ is concerned it is not.

Yes there's a problem: that's not guaranteed to work in C++. But it's very simple to fix, so long as you're on at least C++11: use std::atomic<bool> instead, like this:
#include <atomic>
std::atomic<bool> Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
One problem is that the compiler is allowed to reorder memory accesses, so long as it can prove that it won't change the effect of the program:
int foo() {
int i = 1;
int j = 2;
++i;
++j;
return i + j;
}
Here the compiler is allowed to increment j before i because it clearly won't change the effect of the program. In fact it can optimise the whole thing away into return 5;. So what counts as "won't change the effect of the program?" The answer is long and complex and I don't pretend to understand them all, but one part of it is that the compiler only has to worry about threads in certain contexts. If i and j were global variables instead of local variables, it could still reverse ++i and ++j because it's allowed to assume there's only one thread accessing them unless you use certain thread primatives (such as a mutex).
Now when it comes to code like this:
while (!Global_Stop) {
//...
}
If it can prove the code hidden in the comment doesn't touch the Global_Stop, and there are no thread primatives such as a mutex, it can happily optimise it to:
if (!Global_Stop) {
while (true) {
//...
}
}
If it can prove that Global_Stop is false at the start then it can even remove the if check!
Actually things are even worse than this, at least in theory. You see, if a thread is in the process of writing to a variable when another thread accesses it then only part of that write might be observed, giving you a totally different value (e.g. you update i from 3 to 4 and the other thread reads 7). Admittedly that is unlikely with a bool. But the standard is even more broader than this: this situation is undefined behaviour, so it could even crash your program or have some other weird unexpected behaviour.

Yes, this will most likely work, but only "by accident". As #idclev463035818 already wrote correctly:
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads.
So in this case you should use atomic<bool> instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should already be a strong indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# where volatile is in fact related to the memory model. In these languages a volatile variable is much like an atomic in C++.
In C++, volatile is used for what is often referred to as "unusual memory", where memory can be read or modified outside the current process (for example when using memory mapped I/O). volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:
volatile int x;
int y;
volatile int z;
x = 1;
y = 2;
z = 3;
z = 4;
...
int a = x;
int b = x;
int c = y;
int d = z;
In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to see the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generate code like int b = a. But since x and z are volatile these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.
Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

Simple usage of std::atomic for sharing data between two threads

I have two threads that share a common variable.
The code structure is basically this (very simplified pseudo code):
static volatile bool commondata;
void Thread1()
{
...
commondata = true;
...
}
void Thread2()
{
...
while (!commondata)
{
...
}
...
}
Both threads run and at some point Thread1 sets commondata to true. The while loop in Thread2 should then stop. The important thing here is that Thread2 "sees" the changement made to commondata by Thread1.
I know that the naive method using a volatile variable is not correct and is not guaranteed to work on every platform.
Is it good enough to replace volatile bool commondata with std::atomic<bool> commondata?

Simple answer: yes! :)
All operations on atomics are data race free and by default sequentially consistent.

There is a nice caveat here. While generally 'yes', std::atomic does not, by itself, make the variable volatile. Which means, if compiler can (by some unfathomable means) infer that the variable did not change, it is allowed to not re-read it, since it might assume reading has no side-effects.
If you check, there are overloads for both volatile and non-volatile versions of the class: http://eel.is/c++draft/atomics.types.generic
That could become important if atomic variable lives in shared memory, for example.

volatile variable being optimized away?

I have a background thread which loops on a state variable done. When I want to stop the thread I set the variable done to true. But apparently this variable is never set. I understand that the compiler might optimize it away so I have marked done volatile. But that seems not to have any effect. Note, I am not worried about race conditions so I have not made it atomic or used any synchronization constructs. How do I get the thread to not skip testing the variable at every iteration? Or is the problem something else entirely? done is initially false.
struct SomeObject
{
volatile bool done_;
void DoRun();
};
static void RunLoop(void* arg)
{
if (!arg)
return;
SomeObject* thiz = static_cast<SomeObject*>(arg);
while( !(thiz->done_) ) {
thiz->DoRun();
}
return;
}

volatile doesn't have any multi-threaded meaning in C++. It is a holdover from C, used as a modifier for sig_atomic_t flags touched by signal handlers and for access to memory mapped devices. There is no language-mandated compulsion for a C++ function to re-access memory, which leads to the race condition (reader never bothering to check twice as an "optimization") that others note above.
Use std::atomic (from C++11 or newer).
It can be, and usually is lock-free:
struct SomeObject {
std::atomic_bool done_;
void DoRun();
bool IsDone() { return done_.load(); }
void KillMe() { done_.store(true); }
};
static void RunLoop(void *arg) {
SomeObject &t = static_cast<SomeObject &>(*arg);
cout << t.done_.is_lock_free(); // some archaic platforms may be false
while (!t.IsDone()) {
t.DoRun();
}
}
The load() and store() methods force the compiler to, at the least, check the memory location at every iteration. For x86[_64], the cache line for the SomeObject instance's done_ member will be cached and checked locally with no lock or even atomic/locked memory reads as-is. If you were doing something more complicated than a one-way flag set, you'd need to consider using something like explicit memory fences, etc.
Pre-C++11 has no multi-threaded memory model, so you will have to rely on a third-party library with special compiler privileges like pthreads or use compiler-specific functionality to get the equivalent.

This works like you expect in msvc 2010. If I remove the volatile it loops forever. If I leave the volatile it works. This is because microsoft treats volatile like you expect, which is different than iso.
This works too:
struct CDone {
bool m_fDone;
};
int ThreadProc(volatile CDone *pDone) {
}
Here is what MSDN says:
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object.
Also, the value of the object is written immediately on assignment.
ISO Compliant:
If you are familiar with the C# volatile keyword, or familiar with the behavior of volatile in earlier versions of Visual C++, be aware that the C++11 ISO Standard volatile keyword is different and is supported in Visual Studio when the /volatile:iso compiler option is specified. (For ARM, it's specified by default). The volatile keyword in C++11 ISO Standard code is to be used only for hardware access; do not use it for inter-thread communication. For inter-thread communication, use mechanisms such as std::atomic from theC++ Standard Template Library.

Double-Checked Lock Singleton in C++11

Is the following singleton implementation data-race free?
static std::atomic<Tp *> m_instance;
...
static Tp &
instance()
{
if (!m_instance.load(std::memory_order_relaxed))
{
std::lock_guard<std::mutex> lock(m_mutex);
if (!m_instance.load(std::memory_order_acquire))
{
Tp * i = new Tp;
m_instance.store(i, std::memory_order_release);
}
}
return * m_instance.load(std::memory_order_relaxed);
}
Is the std::memory_model_acquire of the load operation superfluous? Is it possible to further relax both load and store operations by switching them to std::memory_order_relaxed? In that case, is the acquire/release semantic of std::mutex enough to guarantee its correctness, or a further std::atomic_thread_fence(std::memory_order_release) is also required to ensure that the writes to memory of the constructor happen before the relaxed store? Yet, is the use of fence equivalent to have the store with memory_order_release?
EDIT: Thanks to the answer of John, I came up with the following implementation that should be data-race free. Even though the inner load could be non-atomic at all, I decided to leave a relaxed load in that it does not affect the performance. In comparison to always have an outer load with the acquire memory order, the thread_local machinery improves the performance of accessing the instance of about an order of magnitude.
static Tp &
instance()
{
static thread_local Tp *instance;
if (!instance &&
!(instance = m_instance.load(std::memory_order_acquire)))
{
std::lock_guard<std::mutex> lock(m_mutex);
if (!(instance = m_instance.load(std::memory_order_relaxed)))
{
instance = new Tp;
m_instance.store(instance, std::memory_order_release);
}
}
return *instance;
}

I think this a great question and John Calsbeek has the correct answer.
However, just to be clear a lazy singleton is best implemented using the classic Meyers singleton. It has garanteed correct semantics in C++11.
§ 6.7.4
... If control enters
the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization. ...
The Meyer's singleton is preferred in that the compiler can aggressively optimize the concurrent code. The compiler would be more restricted if it had to preserve the semantics of a std::mutex. Furthermore, the Meyer's singleton is 2 lines and virtually impossible to get wrong.
Here is a classic example of a Meyer's singleton. Simple, elegant, and broken in c++03. But simple, elegant, and powerful in c++11.
class Foo
{
public:
static Foo& instance( void )
{
static Foo s_instance;
return s_instance;
}
};

That implementation is not race-free. The atomic store of the singleton, while it uses release semantics, will only synchronize with the matching acquire operation—that is, the load operation that is already guarded by the mutex.
It's possible that the outer relaxed load would read a non-null pointer before the locking thread finished initializing the singleton.
The acquire that is guarded by the lock, on the other hand, is redundant. It will synchronize with any store with release semantics on another thread, but at that point (thanks to the mutex) the only thread that can possibly store is the current thread. That load doesn't even need to be atomic—no stores can happen from another thread.
See Anthony Williams' series on C++0x multithreading.

See also call_once.
Where you'd previously use a singleton to do something, but not actually use the returned object for anything, call_once may be the better solution.
For a regular singleton you could do call_once to set a (global?) variable and then return that variable...
Simplified for brevity:
template< class Function, class... Args>
void call_once( std::once_flag& flag, Function&& f, Args&& args...);
Exactly one execution of exactly one of the functions, passed as f to the invocations in the group (same flag object), is performed.
No invocation in the group returns before the abovementioned execution of the selected function is completed successfully

Is struct assignment atomic in C/C++?

I am writing a program which has one process reading and writing to a shared memory and another process only reading it. In the shared memory there is a struct like this:
struct A{
int a;
int b;
double c;
};
what I expect is to read the struct at once because while I am reading, the other process might be modifying the content of the struct. This can be achieved if the struct assignment is atomic, that is not interrupted. Like this:
struct A r = shared_struct;
So, is struct assignment atomic in C/C++? I tried searching the web but cannot find helpful answers. Can anyone help?
Thank you.

No, both C and C++ standard don't guarantee assignment operations to be atomic. You need some implementation-specific stuff for that - either something in the compiler or in the operating system.

C and C++ support atomic types in their current standards.
C++11 introduced support for atomic types. Likewise C11 introduced atomics.

Do you need to atomically snapshot all the struct members? Or do you just need shared read/write access to separate members separately? The latter is a lot easier, see below.
C11 stdatomic and C++11 std::atomic provide syntax for atomic objects of arbitrary sizes. But if they're larger than 8B or 16B, they won't be lock-free on typical systems, though. (i.e. atomic load, store, exchange or CAS will be implemented by taking a hidden lock and then copying the whole struct).
If you just want a couple members, it's probably better to use a lock yourself and then access the members, instead of getting the compiler to copy the whole struct. (Current compilers aren't good at optimizing weird uses of atomics like that).
Or add a level of indirection, so there's a pointer which can easily be updated atomically to point to another struct with a different set of values. This is the building block for RCU (Read-Copy-Update) See also https://lwn.net/Articles/262464/. There are good library implementations of RCU, so use one instead of rolling your own unless your use-case is a lot simpler than the general case. Figuring out when to free old copies of the struct is one of the hard parts, because you can't do that until the last reader thread is done with it. And the whole point of RCU is to make the read path as light-weight as possible...
Your struct is 16 bytes on most systems; just barely small enough that x86-64 can load or store the entire things somewhat more efficiently than just taking a lock. (But only with lock cmpxchg16b). Still, it's not totally silly to use C/C++ atomics for this
common to both C++11 and C11:
struct A{
int a;
int b;
double c;
};
In C11 use the _Atomic type qualifier to make an atomic type. It's a qualifier like const or volatile, so you can use it on just about anything.
#include <stdatomic.h>
_Atomic struct A shared_struct;
// atomically take a snapshot of the shared state and do something
double read_shared (void) {
struct A tmp = shared_struct; // defaults to memory_order_seq_cst
// or tmp = atomic_load_explicit(&shared_struct, memory_order_relaxed);
//int t = shared_struct.a; // UNDEFINED BEHAVIOUR
// then do whatever you want with the copy, it's a normal struct
if (tmp.a > tmp.b)
tmp.c = -tmp.c;
return tmp.c;
}
// or take tmp by value or pointer as a function arg
// static inline
void update_shared(int a, int b, double c) {
struct A tmp = {a, b, c};
//shared_struct = tmp;
// If you just need atomicity, not ordering, relaxed is much faster for small lock-free objects (no memory barrier)
atomic_store_explicit(&shared_struct, tmp, memory_order_relaxed);
}
Note that accessing a single member of an _Atomic struct is undefined behaviour. It won't respect locking, and might not be atomic. So don't do int i = shared_state.a; (C++11 don't compile that, but C11 will).
In C++11, it's nearly the same: use the std::atomic<T> template.
#include <atomic>
std::atomic<A> shared_struct;
// atomically take a snapshot of the shared state and do something
double read_shared (void) {
A tmp = shared_struct; // defaults to memory_order_seq_cst
// or A tmp = shared_struct.load(std::memory_order_relaxed);
// or tmp = std::atomic_load_explicit(&shared_struct, std::memory_order_relaxed);
//int t = shared_struct.a; // won't compile: no operator.() overload
// then do whatever you want with the copy, it's a normal struct
if (tmp.a > tmp.b)
tmp.c = -tmp.c;
return tmp.c;
}
void update_shared(int a, int b, double c) {
struct A tmp{a, b, c};
//shared_struct = tmp;
// If you just need atomicity, not ordering, relaxed is much faster for small lock-free objects (no memory barrier)
shared_struct.store(tmp, std::memory_order_relaxed);
}
See it on the Godbolt compiler explorer
Of if you don't need to snapshot the entire struct, but instead just want each member to be separately atomic, then you can simply make each member an atomic type. (Like atomic_int and _Atomic double or std::atomic<double>).
struct Amembers {
atomic_int a, b;
#ifdef __cplusplus
std::atomic<double> c;
#else
_Atomic double c;
#endif
} shared_state;
// If these members are used separately, put them in separate cache lines
// instead of in the same struct to avoid false sharing cache-line ping pong.
(Note that C11 stdatomic is not guaranteed to be compatible with C++11
std::atomic, so don't expect to be able to access the same struct from C or C++.)
In C++11, struct-assignment for a struct with atomic members won't compile, because std::atomic deletes its copy-constructor. (You're supposed to load std::atomic<T> shared into T tmp, like in the whole-struct example above.)
In C11, struct-assignment for a non-atomic struct with atomic members will compile but is not atomic. The C11 standard doesn't specifically point this out anywhere. The best I can find is: n1570: 6.5.16.1 Simple assignment:
In simple assignment (=), the value of the right operand is converted to the type of the
assignment expression and replaces the value stored in the object designated by the left
operand.
Since this doesn't say anything about special handling of atomic members, it must be assumed that it's like a memcpy of the object representations. (Except that it's allowed to not update the padding.)
In practice, it's easy to get gcc to generate asm for structs with an atomic member where it copies non-atomically. Especially with a large atomic member which is atomic but not lock-free.

No it isn't.
That is actually a property of the CPU architecture in relation to the memory layout of struck
You could use the 'atomic pointer swap' solution, which can be made atomic, and could be used in a lockfree scenario.
Be sure to mark the respective shared pointer (variables) as volatile if it is important that changes are seen by other threads 'immediately'This, in real life (TM) is not enough to guarantee correct treatment by the compiler. Instead program against atomic primitives/intrinsics directly when you want to have lockfree semantics. (see comments and linked articles for background)
Of course, inversely, you'll have to make sure you take a deep copy at the relevant times in order to do processing on the reading side of this.
Now all of this quickly becomes highly complex in relation to memory management and I suggest you scrutinize your design and ask yourself seriously whether all the (perceived?) performance benefits justify the risk. Why don't you opt for a simple (reader/writer) lock, or get your hands on a fancy shared pointer implementation that is threadsafe ?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js