volatile variable being optimized away? - c++

I have a background thread which loops on a state variable done. When I want to stop the thread I set the variable done to true. But apparently this variable is never set. I understand that the compiler might optimize it away so I have marked done volatile. But that seems not to have any effect. Note, I am not worried about race conditions so I have not made it atomic or used any synchronization constructs. How do I get the thread to not skip testing the variable at every iteration? Or is the problem something else entirely? done is initially false.
struct SomeObject
{
volatile bool done_;
void DoRun();
};
static void RunLoop(void* arg)
{
if (!arg)
return;
SomeObject* thiz = static_cast<SomeObject*>(arg);
while( !(thiz->done_) ) {
thiz->DoRun();
}
return;
}

volatile doesn't have any multi-threaded meaning in C++. It is a holdover from C, used as a modifier for sig_atomic_t flags touched by signal handlers and for access to memory mapped devices. There is no language-mandated compulsion for a C++ function to re-access memory, which leads to the race condition (reader never bothering to check twice as an "optimization") that others note above.
Use std::atomic (from C++11 or newer).
It can be, and usually is lock-free:
struct SomeObject {
std::atomic_bool done_;
void DoRun();
bool IsDone() { return done_.load(); }
void KillMe() { done_.store(true); }
};
static void RunLoop(void *arg) {
SomeObject &t = static_cast<SomeObject &>(*arg);
cout << t.done_.is_lock_free(); // some archaic platforms may be false
while (!t.IsDone()) {
t.DoRun();
}
}
The load() and store() methods force the compiler to, at the least, check the memory location at every iteration. For x86[_64], the cache line for the SomeObject instance's done_ member will be cached and checked locally with no lock or even atomic/locked memory reads as-is. If you were doing something more complicated than a one-way flag set, you'd need to consider using something like explicit memory fences, etc.
Pre-C++11 has no multi-threaded memory model, so you will have to rely on a third-party library with special compiler privileges like pthreads or use compiler-specific functionality to get the equivalent.

This works like you expect in msvc 2010. If I remove the volatile it loops forever. If I leave the volatile it works. This is because microsoft treats volatile like you expect, which is different than iso.
This works too:
struct CDone {
bool m_fDone;
};
int ThreadProc(volatile CDone *pDone) {
}
Here is what MSDN says:
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object.
Also, the value of the object is written immediately on assignment.
ISO Compliant:
If you are familiar with the C# volatile keyword, or familiar with the behavior of volatile in earlier versions of Visual C++, be aware that the C++11 ISO Standard volatile keyword is different and is supported in Visual Studio when the /volatile:iso compiler option is specified. (For ARM, it's specified by default). The volatile keyword in C++11 ISO Standard code is to be used only for hardware access; do not use it for inter-thread communication. For inter-thread communication, use mechanisms such as std::atomic from theC++ Standard Template Library.

Related

Is there anything that would make a static bool thread safe?

I recently came across some code that was working fine where a static bool was shared between multiple threads (single writer, multiple receivers) although there was no synchronization.
Something like that (simplified):
//header A
struct A {
static bool f;
static bool isF() { return f; }
};
//Source A
bool A::f = false;
void threadWriter(){
/* Do something */
A::f = true;
}
// Source B
void threadReader(){
while (!A::isF()) { /* Do something */}
}
For me, this kind of code has a race condition in that even though operations on bool are atomic (on most CPUs), we have no guarantee that the write from the writer thread will be visible to the reader threads. But some people told me that the fact that f is static would help.
So, is there anything in C++11 that would make this code safe? Or anything related to static that would make this code work?
Your hardware may be able to atomically operate on a bool. However, that does not make this code safe. As far as C++ is concerned, you are writing and reading the bool in different threads without synchronisation, which is undefined.
Making the bool static does not change that.
To access the bool in a thread-safe way you can use a std::atomic<bool>. Whether the atomic uses a mutex or other locking is up to the implementation.
Though, also a std::atomic<bool> is not sufficient to synchronize the threadReader() and threadWriter() in case each /*Do something */ is accessing the same shared data.
But some people told me that the fact that f is static would help.
Frankly, this sounds like cargo-cult. I can imagine that this was confused with the fact that initialization of static local variables is thread safe. From cppreference:
If multiple threads attempt to initialize the same static local
variable concurrently, the initialization occurs exactly once (similar
behavior can be obtained for arbitrary functions with std::call_once).
Note: usual implementations of this feature use variants of the
double-checked locking pattern, which reduces runtime overhead for
already-initialized local statics to a single non-atomic boolean
comparison.
Look for Meyers singleton to see an example of that. Though, this is merely about initialization. For example here:
int& foo() {
static int x = 42;
return x;
}
Two threads can call this function concurrently and x will be initialized exactly once. This has no impact on thread-safety of x itself. If two threads call foo and one writes and another reads x there is a data race. However, this is about initialization of static local variables and has nothing to do with your example. I don't know what they meant when they told you static would "help".

Boolean stop signal between threads

What is the simplest way to signal a background thread to stop executing?
I have used something like:
volatile bool Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
Is there anything wrong with this? I know for complex cases you might need "atomic" or mutexes, but for just boolean signaling this should work, right?
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads. There are some myths about volatile, I cannot recall them, because all I remember is that volatile does not help when you need to access something from different threads. You need a std::atomic<bool>. Whether on your actual hardware accessing a bool is atomic does not really matter, because as far as C++ is concerned it is not.
Yes there's a problem: that's not guaranteed to work in C++. But it's very simple to fix, so long as you're on at least C++11: use std::atomic<bool> instead, like this:
#include <atomic>
std::atomic<bool> Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
One problem is that the compiler is allowed to reorder memory accesses, so long as it can prove that it won't change the effect of the program:
int foo() {
int i = 1;
int j = 2;
++i;
++j;
return i + j;
}
Here the compiler is allowed to increment j before i because it clearly won't change the effect of the program. In fact it can optimise the whole thing away into return 5;. So what counts as "won't change the effect of the program?" The answer is long and complex and I don't pretend to understand them all, but one part of it is that the compiler only has to worry about threads in certain contexts. If i and j were global variables instead of local variables, it could still reverse ++i and ++j because it's allowed to assume there's only one thread accessing them unless you use certain thread primatives (such as a mutex).
Now when it comes to code like this:
while (!Global_Stop) {
//...
}
If it can prove the code hidden in the comment doesn't touch the Global_Stop, and there are no thread primatives such as a mutex, it can happily optimise it to:
if (!Global_Stop) {
while (true) {
//...
}
}
If it can prove that Global_Stop is false at the start then it can even remove the if check!
Actually things are even worse than this, at least in theory. You see, if a thread is in the process of writing to a variable when another thread accesses it then only part of that write might be observed, giving you a totally different value (e.g. you update i from 3 to 4 and the other thread reads 7). Admittedly that is unlikely with a bool. But the standard is even more broader than this: this situation is undefined behaviour, so it could even crash your program or have some other weird unexpected behaviour.
Yes, this will most likely work, but only "by accident". As #idclev463035818 already wrote correctly:
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads.
So in this case you should use atomic<bool> instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should already be a strong indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# where volatile is in fact related to the memory model. In these languages a volatile variable is much like an atomic in C++.
In C++, volatile is used for what is often referred to as "unusual memory", where memory can be read or modified outside the current process (for example when using memory mapped I/O). volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:
volatile int x;
int y;
volatile int z;
x = 1;
y = 2;
z = 3;
z = 4;
...
int a = x;
int b = x;
int c = y;
int d = z;
In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to see the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generate code like int b = a. But since x and z are volatile these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.
Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

Implementation of Double Checked Locking in C++ 98/03 using volatile

Reading this article about Double Checked Locking Pattern in C++, I reached the place (page 10) where the authors demonstrate one of the attempts to implement DCLP "correctly" using volatile variables:
class Singleton {
public:
static volatile Singleton* volatile instance();
private:
static volatile Singleton* volatile pInstance;
};
// from the implementation file
volatile Singleton* volatile Singleton::pInstance = 0;
volatile Singleton* volatile Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
}
}
return pInstance;
}
After such example there is a text snippet that I don't understand:
First, the Standard’s constraints on observable behavior are only for
an abstract machine defined by the Standard, and that abstract machine
has no notion of multiple threads of execution. As a result, though
the Standard prevents compilers from reordering reads and writes to
volatile data within a thread, it imposes no constraints at all on
such reorderings across threads. At least that’s how most compiler
implementers interpret things. As a result, in practice, many
compilers may generate thread-unsafe code from the source above.
and later:
... C++’s abstract machine is single-threaded, and C++ compilers may
choose to generate thread-unsafe code from source like the above,
anyway.
These remarks are related to the execution on the uni-processor, so it's definitely not about cache-coherence issues.
If the compiler can't reorder reads and writes to volatile data within a thread, how can it reorder reads and writes across threads for this particular example thus generating thread-unsafe code?
The pointer to the Singleton may be volatile, but the data within the singleton is not.
Imagine Singleton has int x, y, z; as members, set to 15, 16, 17 in the constructor for some reason.
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
OK, temp is written before pInstance. When are x,y,z written relative to those? Before? After? You don't know. They aren't volatile, so they don't need to be ordered relative to the volatile ordering.
Now a thread comes in and sees:
if (pInstance == 0) { // first check
And let's say pInstance has been set, is not null.
What are the values of x,y,z? Even though new Singleton has been called, and the constructor has "run", you don't know whether the operations that set x,y,z have run or not.
So now your code goes and reads x,y,z and crashes, because it was really expecting 15,16,17, not random data.
Oh wait, pInstance is a volatile pointer to volatile data! So x,y,z is volatile right? Right? And thus ordered with pInstance and temp. Aha!
Almost. Any reads from *pInstance will be volatile, but the construction via new Singleton was not volatile. So the initial writes to x,y,z were not ordered. :-(
So you could, maybe, make the members volatile int x, y, z; OK. However...
C++ now has a memory model, even if it didn't when the article was written. Under the current rules, volatile does not prevent data races. volatile has nothing to do with threads. The program is UB. Cats and Dogs living together.
Also, although this is pushing the limits of the standard (ie it gets vague as to what volatile really means), an all-knowing, all-seeing, full-program-optimizing compiler could look at your uses of volatile and say "no, those volatiles don't actually connect to any IO memory addressses etc, they really aren't observable behaviour, I'm just going to make them non-volatile"...
I think they're referring to the cache coherency problem discussed in section 6 ("DCLP on Multiprocessor Machines". With a multiprocessor system, the processor/cache hardware may write out the value for pInstance before the values are written out for the allocated Singleton. This can cause a 2nd CPU to see the non-NULL pInstance before it can see the data it points to.
This requires a hardware fence instruction to ensure all the memory is updated before other CPUs in the system can see any of it.
If I'm understanding correctly they are saying that in the context of a single-thread abstract machine the compiler may simply transform:
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
Into:
pInstance = new Singleton;
Because the observable behavior is unchanged. Then this brings us back to the original problem with double checked locking.

Must I call atomic load/store explicitly?

C++11 introduced the std::atomic<> template library. The standard specifies the store() and load() operations to atomically set / get a variable shared by more than one thread.
My question is are assignment and access operations also atomic?
Namely, is:
std::atomic<bool> stop(false);
...
void thread_1_run_until_stopped()
{
if(!stop.load())
/* do stuff */
}
void thread_2_set_stop()
{
stop.store(true);
}
Equivalent to:
void thread_1_run_until_stopped()
{
if(!stop)
/* do stuff */
}
void thread_2_set_stop()
{
stop = true;
}
Are assignment and access operations for non-reference types also atomic?
Yes, they are. atomic<T>::operator T and atomic<T>::operator= are equivalent to atomic<T>::load and atomic<T>::store respectively. All the operators are implemented in the atomic class such that they will use atomic operations as you would expect.
I'm not sure what you mean about "non-reference" types? Not sure how reference types are relevant here.
You can do both, but the advantage of load()/store() is that they allow to specify memory order. It is important sometimes for performance, where you can specify std::memory_order_relaxed while atomic<T>::operator T and atomic<T>::operator= would use the most safe and slow std::memory_order_seq_cst. Sometimes it is important for correctness and readability of your code: although the default std::memory_order_seq_cst is most safe thus most likely to be correct, it is not immediately clear for the reader what kind of operation (acquire/release/consume) you are doing, or whether you are doing such operation at all (to answer: isn't relaxed order sufficient here?).

Is struct assignment atomic in C/C++?

I am writing a program which has one process reading and writing to a shared memory and another process only reading it. In the shared memory there is a struct like this:
struct A{
int a;
int b;
double c;
};
what I expect is to read the struct at once because while I am reading, the other process might be modifying the content of the struct. This can be achieved if the struct assignment is atomic, that is not interrupted. Like this:
struct A r = shared_struct;
So, is struct assignment atomic in C/C++? I tried searching the web but cannot find helpful answers. Can anyone help?
Thank you.
No, both C and C++ standard don't guarantee assignment operations to be atomic. You need some implementation-specific stuff for that - either something in the compiler or in the operating system.
C and C++ support atomic types in their current standards.
C++11 introduced support for atomic types. Likewise C11 introduced atomics.
Do you need to atomically snapshot all the struct members? Or do you just need shared read/write access to separate members separately? The latter is a lot easier, see below.
C11 stdatomic and C++11 std::atomic provide syntax for atomic objects of arbitrary sizes. But if they're larger than 8B or 16B, they won't be lock-free on typical systems, though. (i.e. atomic load, store, exchange or CAS will be implemented by taking a hidden lock and then copying the whole struct).
If you just want a couple members, it's probably better to use a lock yourself and then access the members, instead of getting the compiler to copy the whole struct. (Current compilers aren't good at optimizing weird uses of atomics like that).
Or add a level of indirection, so there's a pointer which can easily be updated atomically to point to another struct with a different set of values. This is the building block for RCU (Read-Copy-Update) See also https://lwn.net/Articles/262464/. There are good library implementations of RCU, so use one instead of rolling your own unless your use-case is a lot simpler than the general case. Figuring out when to free old copies of the struct is one of the hard parts, because you can't do that until the last reader thread is done with it. And the whole point of RCU is to make the read path as light-weight as possible...
Your struct is 16 bytes on most systems; just barely small enough that x86-64 can load or store the entire things somewhat more efficiently than just taking a lock. (But only with lock cmpxchg16b). Still, it's not totally silly to use C/C++ atomics for this
common to both C++11 and C11:
struct A{
int a;
int b;
double c;
};
In C11 use the _Atomic type qualifier to make an atomic type. It's a qualifier like const or volatile, so you can use it on just about anything.
#include <stdatomic.h>
_Atomic struct A shared_struct;
// atomically take a snapshot of the shared state and do something
double read_shared (void) {
struct A tmp = shared_struct; // defaults to memory_order_seq_cst
// or tmp = atomic_load_explicit(&shared_struct, memory_order_relaxed);
//int t = shared_struct.a; // UNDEFINED BEHAVIOUR
// then do whatever you want with the copy, it's a normal struct
if (tmp.a > tmp.b)
tmp.c = -tmp.c;
return tmp.c;
}
// or take tmp by value or pointer as a function arg
// static inline
void update_shared(int a, int b, double c) {
struct A tmp = {a, b, c};
//shared_struct = tmp;
// If you just need atomicity, not ordering, relaxed is much faster for small lock-free objects (no memory barrier)
atomic_store_explicit(&shared_struct, tmp, memory_order_relaxed);
}
Note that accessing a single member of an _Atomic struct is undefined behaviour. It won't respect locking, and might not be atomic. So don't do int i = shared_state.a; (C++11 don't compile that, but C11 will).
In C++11, it's nearly the same: use the std::atomic<T> template.
#include <atomic>
std::atomic<A> shared_struct;
// atomically take a snapshot of the shared state and do something
double read_shared (void) {
A tmp = shared_struct; // defaults to memory_order_seq_cst
// or A tmp = shared_struct.load(std::memory_order_relaxed);
// or tmp = std::atomic_load_explicit(&shared_struct, std::memory_order_relaxed);
//int t = shared_struct.a; // won't compile: no operator.() overload
// then do whatever you want with the copy, it's a normal struct
if (tmp.a > tmp.b)
tmp.c = -tmp.c;
return tmp.c;
}
void update_shared(int a, int b, double c) {
struct A tmp{a, b, c};
//shared_struct = tmp;
// If you just need atomicity, not ordering, relaxed is much faster for small lock-free objects (no memory barrier)
shared_struct.store(tmp, std::memory_order_relaxed);
}
See it on the Godbolt compiler explorer
Of if you don't need to snapshot the entire struct, but instead just want each member to be separately atomic, then you can simply make each member an atomic type. (Like atomic_int and _Atomic double or std::atomic<double>).
struct Amembers {
atomic_int a, b;
#ifdef __cplusplus
std::atomic<double> c;
#else
_Atomic double c;
#endif
} shared_state;
// If these members are used separately, put them in separate cache lines
// instead of in the same struct to avoid false sharing cache-line ping pong.
(Note that C11 stdatomic is not guaranteed to be compatible with C++11
std::atomic, so don't expect to be able to access the same struct from C or C++.)
In C++11, struct-assignment for a struct with atomic members won't compile, because std::atomic deletes its copy-constructor. (You're supposed to load std::atomic<T> shared into T tmp, like in the whole-struct example above.)
In C11, struct-assignment for a non-atomic struct with atomic members will compile but is not atomic. The C11 standard doesn't specifically point this out anywhere. The best I can find is: n1570: 6.5.16.1 Simple assignment:
In simple assignment (=), the value of the right operand is converted to the type of the
assignment expression and replaces the value stored in the object designated by the left
operand.
Since this doesn't say anything about special handling of atomic members, it must be assumed that it's like a memcpy of the object representations. (Except that it's allowed to not update the padding.)
In practice, it's easy to get gcc to generate asm for structs with an atomic member where it copies non-atomically. Especially with a large atomic member which is atomic but not lock-free.
No it isn't.
That is actually a property of the CPU architecture in relation to the memory layout of struck
You could use the 'atomic pointer swap' solution, which can be made atomic, and could be used in a lockfree scenario.
Be sure to mark the respective shared pointer (variables) as volatile if it is important that changes are seen by other threads 'immediately'This, in real life (TM) is not enough to guarantee correct treatment by the compiler. Instead program against atomic primitives/intrinsics directly when you want to have lockfree semantics. (see comments and linked articles for background)
Of course, inversely, you'll have to make sure you take a deep copy at the relevant times in order to do processing on the reading side of this.
Now all of this quickly becomes highly complex in relation to memory management and I suggest you scrutinize your design and ask yourself seriously whether all the (perceived?) performance benefits justify the risk. Why don't you opt for a simple (reader/writer) lock, or get your hands on a fancy shared pointer implementation that is threadsafe ?