Implementation of Double Checked Locking in C++ 98/03 using volatile - c++

Reading this article about Double Checked Locking Pattern in C++, I reached the place (page 10) where the authors demonstrate one of the attempts to implement DCLP "correctly" using volatile variables:
class Singleton {
public:
static volatile Singleton* volatile instance();
private:
static volatile Singleton* volatile pInstance;
};
// from the implementation file
volatile Singleton* volatile Singleton::pInstance = 0;
volatile Singleton* volatile Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
}
}
return pInstance;
}
After such example there is a text snippet that I don't understand:
First, the Standard’s constraints on observable behavior are only for
an abstract machine defined by the Standard, and that abstract machine
has no notion of multiple threads of execution. As a result, though
the Standard prevents compilers from reordering reads and writes to
volatile data within a thread, it imposes no constraints at all on
such reorderings across threads. At least that’s how most compiler
implementers interpret things. As a result, in practice, many
compilers may generate thread-unsafe code from the source above.
and later:
... C++’s abstract machine is single-threaded, and C++ compilers may
choose to generate thread-unsafe code from source like the above,
anyway.
These remarks are related to the execution on the uni-processor, so it's definitely not about cache-coherence issues.
If the compiler can't reorder reads and writes to volatile data within a thread, how can it reorder reads and writes across threads for this particular example thus generating thread-unsafe code?

The pointer to the Singleton may be volatile, but the data within the singleton is not.
Imagine Singleton has int x, y, z; as members, set to 15, 16, 17 in the constructor for some reason.
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
OK, temp is written before pInstance. When are x,y,z written relative to those? Before? After? You don't know. They aren't volatile, so they don't need to be ordered relative to the volatile ordering.
Now a thread comes in and sees:
if (pInstance == 0) { // first check
And let's say pInstance has been set, is not null.
What are the values of x,y,z? Even though new Singleton has been called, and the constructor has "run", you don't know whether the operations that set x,y,z have run or not.
So now your code goes and reads x,y,z and crashes, because it was really expecting 15,16,17, not random data.
Oh wait, pInstance is a volatile pointer to volatile data! So x,y,z is volatile right? Right? And thus ordered with pInstance and temp. Aha!
Almost. Any reads from *pInstance will be volatile, but the construction via new Singleton was not volatile. So the initial writes to x,y,z were not ordered. :-(
So you could, maybe, make the members volatile int x, y, z; OK. However...
C++ now has a memory model, even if it didn't when the article was written. Under the current rules, volatile does not prevent data races. volatile has nothing to do with threads. The program is UB. Cats and Dogs living together.
Also, although this is pushing the limits of the standard (ie it gets vague as to what volatile really means), an all-knowing, all-seeing, full-program-optimizing compiler could look at your uses of volatile and say "no, those volatiles don't actually connect to any IO memory addressses etc, they really aren't observable behaviour, I'm just going to make them non-volatile"...

I think they're referring to the cache coherency problem discussed in section 6 ("DCLP on Multiprocessor Machines". With a multiprocessor system, the processor/cache hardware may write out the value for pInstance before the values are written out for the allocated Singleton. This can cause a 2nd CPU to see the non-NULL pInstance before it can see the data it points to.
This requires a hardware fence instruction to ensure all the memory is updated before other CPUs in the system can see any of it.

If I'm understanding correctly they are saying that in the context of a single-thread abstract machine the compiler may simply transform:
volatile Singleton* volatile temp = new Singleton;
pInstance = temp;
Into:
pInstance = new Singleton;
Because the observable behavior is unchanged. Then this brings us back to the original problem with double checked locking.

Related

Boolean stop signal between threads

What is the simplest way to signal a background thread to stop executing?
I have used something like:
volatile bool Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
Is there anything wrong with this? I know for complex cases you might need "atomic" or mutexes, but for just boolean signaling this should work, right?
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads. There are some myths about volatile, I cannot recall them, because all I remember is that volatile does not help when you need to access something from different threads. You need a std::atomic<bool>. Whether on your actual hardware accessing a bool is atomic does not really matter, because as far as C++ is concerned it is not.
Yes there's a problem: that's not guaranteed to work in C++. But it's very simple to fix, so long as you're on at least C++11: use std::atomic<bool> instead, like this:
#include <atomic>
std::atomic<bool> Global_Stop = false;
void do_stuff() {
while (!Global_Stop) {
//...
}
}
One problem is that the compiler is allowed to reorder memory accesses, so long as it can prove that it won't change the effect of the program:
int foo() {
int i = 1;
int j = 2;
++i;
++j;
return i + j;
}
Here the compiler is allowed to increment j before i because it clearly won't change the effect of the program. In fact it can optimise the whole thing away into return 5;. So what counts as "won't change the effect of the program?" The answer is long and complex and I don't pretend to understand them all, but one part of it is that the compiler only has to worry about threads in certain contexts. If i and j were global variables instead of local variables, it could still reverse ++i and ++j because it's allowed to assume there's only one thread accessing them unless you use certain thread primatives (such as a mutex).
Now when it comes to code like this:
while (!Global_Stop) {
//...
}
If it can prove the code hidden in the comment doesn't touch the Global_Stop, and there are no thread primatives such as a mutex, it can happily optimise it to:
if (!Global_Stop) {
while (true) {
//...
}
}
If it can prove that Global_Stop is false at the start then it can even remove the if check!
Actually things are even worse than this, at least in theory. You see, if a thread is in the process of writing to a variable when another thread accesses it then only part of that write might be observed, giving you a totally different value (e.g. you update i from 3 to 4 and the other thread reads 7). Admittedly that is unlikely with a bool. But the standard is even more broader than this: this situation is undefined behaviour, so it could even crash your program or have some other weird unexpected behaviour.
Yes, this will most likely work, but only "by accident". As #idclev463035818 already wrote correctly:
std::atomic is not for "complex cases". It is for when you need to access something from multiple threads.
So in this case you should use atomic<bool> instead of volatile. The fact that volatile has been part of that language long before the introduction of threads in C++11 should already be a strong indication that volatile was never designed or intended to be used for multi-threading. It is important to note that in C++ volatile is something fundamentally different from volatile in languages like Java or C# where volatile is in fact related to the memory model. In these languages a volatile variable is much like an atomic in C++.
In C++, volatile is used for what is often referred to as "unusual memory", where memory can be read or modified outside the current process (for example when using memory mapped I/O). volatile forces the compiler to execute all operations in the exact order as specified. This prevents some optimizations that would be perfectly legal for atomics, while also allowing some optimizations that are actually illegal for atomics. For example:
volatile int x;
int y;
volatile int z;
x = 1;
y = 2;
z = 3;
z = 4;
...
int a = x;
int b = x;
int c = y;
int d = z;
In this example, there are two assignments to z, and two read operations on x. If x and z were atomics instead of volatile, the compiler would be free to see the first store as irrelevant and simply remove it. Likewise it could just reuse the value returned by the first load of x, effectively generate code like int b = a. But since x and z are volatile these optimizations are not possible. Instead, the compiler has to ensure that all volatile operations are executed in the exact order as specified, i.e., the volatile operations cannot be reordered with respect to each other. However, this does not prevent the compiler from reordering non-volatile operations. For example, the operations on y could freely be moved up or down - something that would not be possible if x and z were atomics. So if you were to try implementing a lock based on a volatile variable, the compiler could simply (and legally) move some code outside your critical section.
Last but not least it should be noted that marking a variable as volatile does not prevent it from participating in a data race. In those rare cases where you have some "unusual memory" (and therefore really require volatile) that is also accessed by multiple threads, you have to use volatile atomics.

Simple usage of std::atomic for sharing data between two threads

I have two threads that share a common variable.
The code structure is basically this (very simplified pseudo code):
static volatile bool commondata;
void Thread1()
{
...
commondata = true;
...
}
void Thread2()
{
...
while (!commondata)
{
...
}
...
}
Both threads run and at some point Thread1 sets commondata to true. The while loop in Thread2 should then stop. The important thing here is that Thread2 "sees" the changement made to commondata by Thread1.
I know that the naive method using a volatile variable is not correct and is not guaranteed to work on every platform.
Is it good enough to replace volatile bool commondata with std::atomic<bool> commondata?
Simple answer: yes! :)
All operations on atomics are data race free and by default sequentially consistent.
There is a nice caveat here. While generally 'yes', std::atomic does not, by itself, make the variable volatile. Which means, if compiler can (by some unfathomable means) infer that the variable did not change, it is allowed to not re-read it, since it might assume reading has no side-effects.
If you check, there are overloads for both volatile and non-volatile versions of the class: http://eel.is/c++draft/atomics.types.generic
That could become important if atomic variable lives in shared memory, for example.

Is implementation of double checked singleton thread-safe?

I know that the common implementation of thread-safe singleton looks like this:
Singleton* Singleton::instance() {
if (pInstance == 0) {
Lock lock;
if (pInstance == 0) {
Singleton* temp = new Singleton; // initialize to temp
pInstance = temp; // assign temp to pInstance
}
}
return pInstance;
}
But why they say that it is a thread-safe implementation?
For example, the first thread can pass both tests on pInstance == 0, create new Singleton and assign it to the temp pointer and then start assignment pInstance = temp (as far as I know, the pointer assignment operation is not atomic).
At the same time the second thread tests the first pInstance == 0, where pInstance is assigned only half. It's not nullptr but not a valid pointer too, which then returned from the function.
Can such a situation happen? I didn't find the answer anywhere and seems that it is a quite correct implementation and I don't understand anything
It's not safe by C++ concurrency rules, since the first read of pInstance is not protected by a lock or something similar and thus does not synchronise correctly with the write (which is protected). There is therefore a data race and thus Undefined Behaviour. One of the possible results of this UB is precisely what you've identified: the first check reading a garbage value of pInstance which is just being written by a different thread.
The common explanation is that it saves on acquiring the lock (a potentially time-expensive operation) in the more common case (pInstance already valid). However, it's not safe.
Since C++11 and beyond guarantees initialisation of function-scope static variables happens only once and is thread-safe, the best way to create a singleton in C++ is to have a static local in the function:
Singleton& Singleton::instance() {
static Singleton s;
return s;
}
Note that there's no need for either dynamic allocation or a pointer return type.
As Voo mentioned in comments, the above assumes pInstance is a raw pointer. If it was std::atomic<Singleton*> instead, the code would work just fine as intended. Of course, it's then a question whether an atomic read is that much slower than obtaining the lock, which should be answered via profiling. Still, it would be a rather pointless exercise, since the static local variables is better in all but very obscure cases.

volatile variable being optimized away?

I have a background thread which loops on a state variable done. When I want to stop the thread I set the variable done to true. But apparently this variable is never set. I understand that the compiler might optimize it away so I have marked done volatile. But that seems not to have any effect. Note, I am not worried about race conditions so I have not made it atomic or used any synchronization constructs. How do I get the thread to not skip testing the variable at every iteration? Or is the problem something else entirely? done is initially false.
struct SomeObject
{
volatile bool done_;
void DoRun();
};
static void RunLoop(void* arg)
{
if (!arg)
return;
SomeObject* thiz = static_cast<SomeObject*>(arg);
while( !(thiz->done_) ) {
thiz->DoRun();
}
return;
}
volatile doesn't have any multi-threaded meaning in C++. It is a holdover from C, used as a modifier for sig_atomic_t flags touched by signal handlers and for access to memory mapped devices. There is no language-mandated compulsion for a C++ function to re-access memory, which leads to the race condition (reader never bothering to check twice as an "optimization") that others note above.
Use std::atomic (from C++11 or newer).
It can be, and usually is lock-free:
struct SomeObject {
std::atomic_bool done_;
void DoRun();
bool IsDone() { return done_.load(); }
void KillMe() { done_.store(true); }
};
static void RunLoop(void *arg) {
SomeObject &t = static_cast<SomeObject &>(*arg);
cout << t.done_.is_lock_free(); // some archaic platforms may be false
while (!t.IsDone()) {
t.DoRun();
}
}
The load() and store() methods force the compiler to, at the least, check the memory location at every iteration. For x86[_64], the cache line for the SomeObject instance's done_ member will be cached and checked locally with no lock or even atomic/locked memory reads as-is. If you were doing something more complicated than a one-way flag set, you'd need to consider using something like explicit memory fences, etc.
Pre-C++11 has no multi-threaded memory model, so you will have to rely on a third-party library with special compiler privileges like pthreads or use compiler-specific functionality to get the equivalent.
This works like you expect in msvc 2010. If I remove the volatile it loops forever. If I leave the volatile it works. This is because microsoft treats volatile like you expect, which is different than iso.
This works too:
struct CDone {
bool m_fDone;
};
int ThreadProc(volatile CDone *pDone) {
}
Here is what MSDN says:
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object.
Also, the value of the object is written immediately on assignment.
ISO Compliant:
If you are familiar with the C# volatile keyword, or familiar with the behavior of volatile in earlier versions of Visual C++, be aware that the C++11 ISO Standard volatile keyword is different and is supported in Visual Studio when the /volatile:iso compiler option is specified. (For ARM, it's specified by default). The volatile keyword in C++11 ISO Standard code is to be used only for hardware access; do not use it for inter-thread communication. For inter-thread communication, use mechanisms such as std::atomic from theC++ Standard Template Library.

What is the difference between using explicit fences and std::atomic?

Assuming that aligned pointer loads and stores are naturally atomic on the target platform, what is the difference between this:
// Case 1: Dumb pointer, manual fence
int* ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
ptr = new int(-4);
this:
// Case 2: atomic var, automatic fence
std::atomic<int*> ptr;
// ...
ptr.store(new int(-4), std::memory_order_release);
and this:
// Case 3: atomic var, manual fence
std::atomic<int*> ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
ptr.store(new int(-4), std::memory_order_relaxed);
I was under the impression that they were all equivalent, however Relacy detects a data race in
the first case (only):
struct test_relacy_behaviour : public rl::test_suite<test_relacy_behaviour, 2>
{
rl::var<std::string*> ptr;
rl::var<int> data;
void before()
{
ptr($) = nullptr;
rl::atomic_thread_fence(rl::memory_order_seq_cst);
}
void thread(unsigned int id)
{
if (id == 0) {
std::string* p = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
ptr($) = p;
}
else {
std::string* p2 = ptr($); // <-- Test fails here after the first thread completely finishes executing (no contention)
rl::atomic_thread_fence(rl::memory_order_acquire);
RL_ASSERT(!p2 || *p2 == "Hello" && data($) == 42);
}
}
void after()
{
delete ptr($);
}
};
I contacted the author of Relacy to find out if this was expected behaviour; he says that there is indeed a data race in my test case.
However, I'm having trouble spotting it; can someone point out to me what the race is?
Most importantly, what are the differences between these three cases?
Update: It's occurred to me that Relacy may simply be complaining about the atomicity (or lack thereof, rather) of the variable being accessed across threads... after all, it doesn't know that I intend only to use this code on platforms where aligned integer/pointer access is naturally atomic.
Another update: Jeff Preshing has written an excellent blog post explaining the difference between explicit fences and the built-in ones ("fences" vs "operations"). Cases 2 and 3 are apparently not equivalent! (In certain subtle circumstances, anyway.)
Although various answers cover bits and pieces of what the potential problem is and/or provide useful information, no answer correctly describes the potential issues for all three cases.
In order to synchronize memory operations between threads, release and acquire barriers are used to specify ordering.
In the diagram, memory operations A in thread 1 cannot move down across the (one-way) release barrier (regardless whether that is a release operation on an atomic store,
or a standalone release fence followed by a relaxed atomic store). Hence memory operations A are guaranteed to happen before the atomic store.
Same goes for memory operations B in thread 2 which cannot move up across the acquire barrier; hence the atomic load happens before memory operations B.
The atomic ptr itself provides inter-thread ordering based on the guarantee that it has a single modification order. As soon as thread 2 sees a value for ptr,
it is guaranteed that the store (and thus memory operations A) happened before the load. Because the load is guaranteed to happen before memory operations B,
the rules for transitivity say that memory operations A happen before B and synchronization is complete.
With that, let's look at your 3 cases.
Case 1 is broken because ptr, a non-atomic type, is modified in different threads. That is a classical example of a data race and it causes undefined behavior.
Case 2 is correct. As an argument, the integer allocation with new is sequenced before the release operation. This is equivalent to:
// Case 2: atomic var, automatic fence
std::atomic<int*> ptr;
// ...
int *tmp = new int(-4);
ptr.store(tmp, std::memory_order_release);
Case 3 is broken, albeit in a subtle way. The problem is that even though the ptr assignment is correctly sequenced after the standalone fence,
the integer allocation (new) is also sequenced after the fence, causing a data race on the integer memory location.
the code is equivalent to:
// Case 3: atomic var, manual fence
std::atomic<int*> ptr;
// ...
std::atomic_thread_fence(std::memory_order_release);
int *tmp = new int(-4);
ptr.store(tmp, std::memory_order_relaxed);
If you map that to the diagram above, the new operator is supposed to be part of memory operations A. Being sequenced below the release fence,
ordering guarantees no longer hold and the integer allocation may actually be reordered with memory operations B in thread 2.
Therefore, a load() in thread 2 may return garbage or cause other undefined behavior.
I believe the code has a race. Case 1 and case 2 are not equivalent.
29.8 [atomics.fences]
-2- A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
In case 1 your release fence does not synchronize with your acquire fence because ptr is not an atomic object and the store and load on ptr are not atomic operations.
Case 2 and case 3 are equivalent (actually, not quite, see LWimsey's comments and answer), because ptr is an atomic object and the store is an atomic operation. (Paragraphs 3 and 4 of [atomic.fences] describe how a fence synchronizes with an atomic operation and vice versa.)
The semantics of fences are defined only with respect to atomic objects and atomic operations. Whether your target platform and your implementation offer stronger guarantees (such as treating any pointer type as an atomic object) is implementation-defined at best.
N.B. for both of case 2 and case 3 the acquire operation on ptr could happen before the store, and so would read garbage from the uninitialized atomic<int*>. Simply using acquire and release operations (or fences) doesn't ensure that the store happens before the load, it only ensures that if the load reads the stored value then the code is correctly synchronized.
Several pertinent references:
the C++11 draft standard (PDF, see clauses 1, 29 and 30);
Hans-J. Boehm's overview of concurrency in C++;
McKenney, Boehm and Crowl on concurrency in C++;
GCC's developmental notes on concurrency in C++;
the Linux kernel's notes on concurrency;
a related question with answers here on Stackoverflow;
another related question with answers;
Cppmem, a sandbox in which to experiment with concurrency;
Cppmem's help page;
Spin, a tool for analyzing the logical consistency of concurrent systems;
an overview of memory barriers from a hardware perspective (PDF).
Some of the above may interest you and other readers.
The memory backing an atomic variable can only ever be used for the contents of the atomic. However, a plain variable, like ptr in case 1, is a different story. Once a compiler has the right to write to it, it can write anything to it, even the value of a temporary value when you run out of registers.
Remember, your example is pathologically clean. Given a slightly more complex example:
std::string* p = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
std::string* p2 = new std::string("Bye");
ptr($) = p;
it is totally legal for the compiler to choose to reuse your pointer
std::string* p = new std::string("Hello");
data($) = 42;
rl::atomic_thread_fence(rl::memory_order_release);
ptr($) = new std::string("Bye");
std::string* p2 = ptr($);
ptr($) = p;
Why would it do so? I don't know, perhaps some exotic trick to keep a cache line or something. The point is that, since ptr is not atomic in case 1, there is a race case between the write on line 'ptr($) = p' and the read on 'std::string* p2 = ptr($)', yielding undefined behavior. In this simple test case, the compiler may not choose to exercise this right, and it may be safe, but in more complicated cases the compiler has the right to abuse ptr however it pleases, and Relacy catches this.
My favorite article on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
The race in the first example is between the publication of the pointer, and the stuff that it points to. The reason is, that you have the creation and initialization of the pointer after the fence (= on the same side as the publication of the pointer):
int* ptr; //noop
std::atomic_thread_fence(std::memory_order_release); //fence between noop and interesting stuff
ptr = new int(-4); //object creation, initalization, and publication
If we assume that CPU accesses to properly aligned pointers are atomic, the code can be corrected by writing this:
int* ptr; //noop
int* newPtr = new int(-4); //object creation & initalization
std::atomic_thread_fence(std::memory_order_release); //fence between initialization and publication
ptr = newPtr; //publication
Note that even though this may work fine on many machines, there is absolutely no guarantee within the C++ standard on the atomicity of the last line. So better use atomic<> variables in the first place.