Proper compiler intrinsics for double-checked locking?

Proper compiler intrinsics for double-checked locking? - c++

When implementing double-checked locking, what is the proper way to do the memory and/or compiler barriers when implementing double-checked locking for initialization?
Something like std::call_once isn't what I want; it's way too slow. It's typically just implemented on top of pthread_mutex_lock and EnterCriticalSection respective to OS.
In my programs, I often run into initialization cases where the initialization is safe to repeat, as long as exactly one thread gets to set the final pointer. If another thread beats it to setting the final pointer to the singleton object, it deletes what it created and makes use of the other thread's. I also often use this in cases where it doesn't matter which thread "wins" because they all come up with the same result.
Here's an unsafe, overly-contrived example, using Visual C++ intrinsics:
MyClass *GetGlobalMyClass()
{
static MyClass *const UNSET_POINTER = reinterpret_cast<MyClass *>(
static_cast<intptr_t>(-1));
static MyClass *volatile s_object = UNSET_POINTER;
if (s_object == UNSET_POINTER)
{
MyClass *newObject = MyClass::Create();
if (_InterlockedCompareExchangePointer(&s_object, newObject,
UNSET_POINTER) != UNSET_POINTER)
{
// Another thread beat us. If Create didn't return null, destroy.
if (newObject)
{
newObject->Destroy(); // calls "delete this;", presumably
}
}
}
return s_object;
}
On a weakly-ordered memory architecture, my understanding is that it's possible that the new value of s_object is visible to other threads before other variables written inside MyClass::Create or MyClass::MyClass are visible. Also, the compiler itself could arrange the code this way in the absence of a compiler barrier (in Visual C++, _WriteBarrier, but _InterlockedCompareExchange acts as a barrier).
Do I need like a store fence intrinsic function in there or something in order to ensure that MyClass's variables are visible to all threads before s_object becomes somethings besides -1?

Fortunately, the rules in C++ are very simple:
If there is a data race, the behaviour is undefined.
In you code the data race is caused by the following read, which conflicts with the write operation in __InterlockedCompareExchangePointer.
if (s_object.m_void == UNSET_POINTER)
A thread-safe solution without blocking might look as follows. Note that on x86 a load operation with sequential consistency has basically no overhead compared to a regular load operation. If you care about other architectures, you can also use acquire release instead of sequential consistency.
static std::atomic<MyClass*> s_object{nullptr};
MyClass* o = s_object.load(std::memory_order_seq_cst);
if (o == nullptr) {
o = new MyClass{...};
MyClass* expected = nullptr;
if (!s_object.compare_exchange_strong(expected, o, std::memory_order_seq_cst)) {
delete o;
o = expected;
}
}
return o;

For a proper C++11 implementation any function-local static variable will be constructed in a thread-safe fashion by the first thread passing through this variable.

Related

Confusion about thread-safety

I am new to the world of concurrency but from what I have read I understand the program below to be undefined in its execution. If I understand correctly this is not threadsafe as I am concurrently reading/writing both the shared_ptr and the counter variable in non-atomic ways.
#include <string>
#include <memory>
#include <thread>
#include <chrono>
#include <iostream>
struct Inner {
Inner() {
t_ = std::thread([this]() {
counter_ = 0;
running_ = true;
while (running_) {
counter_++;
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
});
}
~Inner() {
running_ = false;
if (t_.joinable()) {
t_.join();
}
}
std::uint64_t counter_;
std::thread t_;
bool running_;
};
struct Middle {
Middle() {
data_.reset(new Inner);
t_ = std::thread([this]() {
running_ = true;
while (running_) {
data_.reset(new Inner());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
});
}
~Middle() {
running_ = false;
if (t_.joinable()) {
t_.join();
}
}
std::uint64_t inner_data() {
return data_->counter_;
}
std::shared_ptr<Inner> data_;
std::thread t_;
bool running_;
};
struct Outer {
std::uint64_t data() {
return middle_.inner_data();
}
Middle middle_;
};
int main() {
Outer o;
while (true) {
std::cout << "Data: " << o.data() << std::endl;
}
return 0;
}
My confusion comes from this:
Is the access to data_->counter safe in Middle::inner_data?
If thread A has a member shared_ptr<T> sp and decides to update it while thread B does shared_ptr<T> sp = A::sp will the copy and destruction be threadsafe? Or do I risk the copy failing because the object is in the process of being destroyed.
Under what circumstances (can I check this with some tool?) is undefined likely to mean std::terminate? I suspect something like the above happens in some of my production code but I cannot be certain as I am confused about 1 and 2 but this small program has been running for days since I wrote it and nothing happens.
Code can be checked here at https://godbolt.org/g/saHz94

Is the access to data_->counter safe in Middle::inner_data?
No; it's a race condition. According to the standard, it's undefined behavior anytime you allow unsynchronized access to the same variable from more than one thread, and at least one thread might possibly modify the variable.
As a practical matter, here are a couple of unwanted behaviors you might see:
The thread reading the value of counter_ reads an "old" value of counter (that rarely or never updates) due to different processor cores caching the variable independently of each other (using atomic_t would avoid this problem, because then the compiler would be aware that you are intending this variable to be accessed in an unsynchronized manner, and it would know to take precautions to prevent this problem)
Thread A might read the address that the data_ shared_pointer points to and be just about to dereference the address and read from the Inner struct it points to, when Thread A gets kicked off the CPU by thread B. Thread B executes, and during Thread B's execution, the old Inner struct gets deleted and the data_ shared_pointer set to point to a new Inner struct. Then Thread A gets back onto the CPU again, but since Thread A already has the old pointer value in memory, it dereferences the old value rather than the new one and ends up reading from freed/invalid memory. Again, this is undefined behavior, so in principle anything could happen; in practice you're likely to see either no obvious misbehavior, or occasionally a wrong/garbage value, or possibly a crash, it depends.
If thread A has a member shared_ptr sp and decides to update it
while thread B does shared_ptr sp = A::sp will the copy and
destruction be threadsafe? Or do I risk the copy failing because the
object is in the process of being destroyed.
If you're only retargeting the shared_ptrs themselves (i.e. changing them to point to different objects) and not modifying the T objects that they point to, that should be thread safe AFAIK. But if you are modifying state of the T objects themselves (i.e. the Inner object in your example) that is not thread safe, since you could have one thread reading from the object while another thread is writing to it (deleting the object can be seen as a special case of writing to it, in that it definitely changes the object's state)
Under what circumstances (can I check this with some tool?) is
undefined likely to mean std::terminate?
When you hit undefined behavior, it's very much dependent on the details of your program, the compiler, the OS, and the hardware architecture what will happen. In principle, undefined behavior means anything (including the program running just as you intended!) can happen, but you can't rely on any particular behavior -- which is what makes undefined behavior so evil.
In particular, it's common for a multithreaded program with a race condition to run fine for hours/days/weeks and then one day the timing is just right and it crashes or computes an incorrect result. Race conditions can be really difficult to reproduce for that reason.
As for when terminate() might be called, terminate() would be called if the the fault causes an error state that is detected by the runtime environment (i.e. it corrupts a data structure that the runtime environment does integrity checks on, such as, in some implementations, the heap's metadata). Whether or not that actually happens depends on how the heap was implemented (which varies from one OS and compiler to the next) and what sort of corruption the fault introduced.

Thread safety is an operation between threads, not an absolute in general.
You cannot read or write a variable while another thread writes a variable without synchronization between the other thread's write and your read or write. Doing so is undefined behavior.
Undefined can mean anything. Program crashes. Program reads impossible value. Program formats hard drive. Program emails your browser history to all of your contacts.
A common case for unsynchronized integer access is that the compiler optimizes multiple reads to a value into one and doesn't reload it, because it can prove there is no defined way that someone could have modified the value. Or, the CPU memory cache does the same thing, because you did not synchronize.
For the pointers, similar or worse problems can occur, including following dangling pointers, corrupting memory, crashes, etc.
There are now atomic operations you can perform on shared pointers., as well as atomic<shared_ptr<?>>.

Synchronizing method calls on shared object from multiple threads

I am thinking about how to implement a class that will contain private data that will be eventually be modified by multiple threads through method calls. For synchronization (using the Windows API), I am planning on using a CRITICAL_SECTION object since all the threads will spawn from the same process.
Given the following design, I have a few questions.
template <typename T> class Shareable
{
private:
const LPCRITICAL_SECTION sync; //Can be read and used by multiple threads
T *data;
public:
Shareable(LPCRITICAL_SECTION cs, unsigned elems) : sync{cs}, data{new T[elems]} { }
~Shareable() { delete[] data; }
void sharedModify(unsigned index, T &datum) //<-- Can this be validly called
//by multiple threads with synchronization being implicit?
{
EnterCriticalSection(sync);
/*
The critical section of code involving reads & writes to 'data'
*/
LeaveCriticalSection(sync);
}
};
// Somewhere else ...
DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
Shareable<ActualType> *ptr = static_cast<Shareable<ActualType>*>(lpParameter);
T copyable = /* initialization */;
ptr->sharedModify(validIndex, copyable); //<-- OK, synchronized?
return 0;
}
The way I see it, the API calls will be conducted in the context of the current thread. That is, I assume this is the same as if I had acquired the critical section object from the pointer and called the API from within ThreadProc(). However, I am worried that if the object is created and placed in the main/initial thread, there will be something funky about the API calls.
When sharedModify() is called on the same object concurrently,
from multiple threads, will the synchronization be implicit, in the
way I described it above?
Should I instead get a pointer to the
critical section object and use that instead?
Is there some other
synchronization mechanism that is better suited to this scenario?

When sharedModify() is called on the same object concurrently, from multiple threads, will the synchronization be implicit, in the way I described it above?
It's not implicit, it's explicit. There's only only CRITICAL_SECTION and only one thread can hold it at a time.
Should I instead get a pointer to the critical section object and use that instead?
No. There's no reason to use a pointer here.
Is there some other synchronization mechanism that is better suited to this scenario?
It's hard to say without seeing more code, but this is definitely the "default" solution. It's like a singly-linked list -- you learn it first, it always works, but it's not always the best choice.

When sharedModify() is called on the same object concurrently, from multiple threads, will the synchronization be implicit, in the way I described it above?
Implicit from the caller's perspective, yes.
Should I instead get a pointer to the critical section object and use that instead?
No. In fact, I would suggest giving the Sharable object ownership of its own critical section instead of accepting one from the outside (and embrace RAII concepts to write safer code), eg:
template <typename T>
class Shareable
{
private:
CRITICAL_SECTION sync;
std::vector<T> data;
struct SyncLocker
{
CRITICAL_SECTION &sync;
SyncLocker(CRITICAL_SECTION &cs) : sync(cs) { EnterCriticalSection(&sync); }
~SyncLocker() { LeaveCriticalSection(&sync); }
}
public:
Shareable(unsigned elems) : data(elems)
{
InitializeCriticalSection(&sync);
}
Shareable(const Shareable&) = delete;
Shareable(Shareable&&) = delete;
~Shareable()
{
{
SyncLocker lock(sync);
data.clear();
}
DeleteCriticalSection(&sync);
}
void sharedModify(unsigned index, const T &datum)
{
SyncLocker lock(sync);
data[index] = datum;
}
Shareable& operator=(const Shareable&) = delete;
Shareable& operator=(Shareable&&) = delete;
};
Is there some other synchronization mechanism that is better suited to this scenario?
That depends. Will multiple threads be accessing the same index at the same time? If not, then there is not really a need for the critical section at all. One thread can safely access one index while another thread accesses a different index.
If multiple threads need to access the same index at the same time, a critical section might still not be the best choice. Locking the entire array might be a big bottleneck if you only need to lock portions of the array at a time. Things like the Interlocked API, or Slim Read/Write locks, might make more sense. It really depends on your thread designs and what you are actually trying to protect.

C++: Thread Safety in a Signal/Slot Library

I'm implementing a Signal/Slot framework, and got to the point that I want it to be thread-safe. I already had a lot of support from the Boost mailing-list, but since this is not really boost-related, I'll ask my pending question here.
When is a signal/slot implementation (or any framework that calls functions outside itself, specified in some way by the user) considered thread-safe? Should it be safe w.r.t. its own data, i.e. the data associated to its implementation details? Or should it also take into account the user's data, which might or might not be modified whatever functions are passed to the framework?
This is an example given on the mailing-list (Edit: this is an example use-case --i.e. user code--. My code is behind the calls to the Emitter object):
int * somePtr = nullptr;
Emitter<Event> em; // just an object that can emit the 'Event' signal
void mainThread()
{
em.connect<Event>(someFunction);
// now, somehow, 2 threads are created which, at some point
// execute the thread1() and thread2() functions below
}
void someFunction()
{
// can somePtr change after the check but before the set?
if (somePtr)
*somePtr = 17;
}
void cleanupPtr()
{
// this looks safe, but compilers and CPUs can reorder this code:
int *tmp = somePtr;
somePtr = null;
delete tmp;
}
void thread1()
{
em.emit<Event>();
}
void thread2()
{
em.disconnect<Event>(someFunction);
// now safe to cleanup (?)
cleanupPtr();
}
In the above code, it might happen that Event is emitted, causing someFunction to be executed. If somePtr is non-null, but becomes null just after the if, but before the assignment, we're in trouble. From the point of view of thread2, this is not obvious because it is disconnecting someFunction before calling cleanupPtr.
I can see why this could potentially lead to trouble, but who's responsibility is this? Should my library protect the user from using it in every irresponsible but imaginable way?

I suspect there is no clearly good answer, but clarity will come from documenting the guarantees you wish to make about concurrent access to an Emitter object.
One level of guarantee, which to me is what is implied by a promise of thread safety, is that:
Concurrent operations on the object are guaranteed to leave the object in a consistent state (at least, from the point of view of the accessing threads.)
Non-commutative operations will be performed as if they were scheduled serially in some (unknown) order.
Then the question is, what does the emit method promise semantically: passing control to the connected routine, or evaluation of the function? If the former, then your work sounds like it is already done; if the latter, then the 'as-if ordered' requirement would mean that you need to enforce some level of synchronisation.
Users of the library can work with either, provided it is clear what is being promised.

Firstly the simplest possibility: If you don't claim your library to be thread-safe, you don't have to bother about this.
(But even) if you do:
In your example the user would have to take care about thread-safety, since both functions could be dangerous, even without using your event-system (IMHO, this is a pretty good way to determine who should take care about those kind of problems). A possible way for him to do this in C++11 could be:
#include <mutex>
// A mutex is used to control thread-acess to a shared resource
std::mutex _somePtr_mutex;
int* somePtr = nullptr;
void someFunction()
{
/*
Create a 'lock_guard' to manage your mutex.
Is the mutex '_somePtr_mutex' already locked?
Yes: Wait until it's unlocked.
No: Lock it and continue execution.
*/
std::lock_guard<std::mutex> lock(_somePtr_mutex);
if(somePtr)
*somePtr = 17;
// End of scope: 'lock' gets destroyed and hence unlocks '_somePtr_mutex'
}
void cleanupPtr()
{
/*
Create a 'lock_guard' to manage your mutex.
Is the mutex '_somePtr_mutex' already locked?
Yes: Wait until it's unlocked.
No: Lock it and continue execution.
*/
std::lock_guard<std::mutex> lock(_somePtr_mutex);
int *tmp = somePtr;
somePtr = null;
delete tmp;
// End of scope: 'lock' gets destroyed and hence unlocks '_somePtr_mutex'
}

The last question is easy. If you say your library is threadsafe, it should threadsafe. It makes no sense to say it is partly threadsafe or, it is only threadsafe if you do not abuse it. In that case you have to explain what exactly is not threadsafe.
Now to your first question regarded someFunction:
The operation is non atomic. Which means the CPU can interrupt between the if and the assigment. And that will happen, I know that :-) The other thread can erase the pointer anytime. Even between two short and fast looking statements.
Now to cleanupPtr:
I am not a compiler expert, but if you want to be shure that your assigment take place in the same moment you wrote it in code you should write the keyword volatile in front of the declaration of somePtr. The compiler will now know that you use that attribute in a multithreaded situation and will not buffer the value in a register of the CPU.
If you have a thread situation with a reader thread and a writer thread, the keyword volatile can (IMHO) be enough to sync them. As long as the attributes you use to exchange information between threads are generic.
For other situations you can use mutex or atomics. I will give you an example for mutex. I use C++11 for that, but it works similar with previous versions of C++ using boost.
Using mutex:
int * somePtr = nullptr;
Emitter<Event> em; // just an object that can emit the 'Event' signal
std::recursive_mutex g_mutex;
void mainThread()
{
em.connect<Event>(someFunction);
// now, somehow, 2 threads are created which, at some point
// execute the thread1() and thread2() functions below
}
void someFunction()
{
std::lock_guard<std::recursive_mutex> lock(g_mutex);
// can somePtr change after the check but before the set?
if (somePtr)
*somePtr = 17;
}
void cleanupPtr()
{
std::lock_guard<std::recursive_mutex> lock(g_mutex);
// this looks safe, but compilers and CPUs can reorder this code:
int *tmp = somePtr;
somePtr = null;
delete tmp;
}
void thread1()
{
em.emit<Event>();
}
void thread2()
{
em.disconnect<Event>(someFunction);
// now safe to cleanup (?)
cleanupPtr();
}
I only added a recursive mutex here without changing any other code of the sample, even if it's now cargo code.
There are two kinds of mutex in the std. A utterly useless std::mutex and the std::recursive_mutex which work like you expect a mutex should work. The std::mutex exclude the access of any further call even from the same thread. Which can happen if a method which needs mutex protection calls a public method which use the same mutex. std::recursive_mutex is reentrant for the same thread.
Atomics (or interlocks in win32) are another way, but only to exchange values between threads or access them concurrently. Your example is missing such values, but in your case, I would look a little deeper in them (std::atomic).
UPDATE
If your are the user of a library which is not explicit declared as threadsafe by the developer, take it as non threadsafe and shield every call to it with a mutex lock.
To stick with the example. If you cannot change someFunction the you have to wrap the function like:
void threadsafeSomeFunction()
{
std::lock_guard<std::recursive_mutex> lock(g_mutex);
someFunction();
}

Do I need to use volatile keyword if I declare a variable between mutexes and return it?

Let's say I have the following function.
std::mutex mutex;
int getNumber()
{
mutex.lock();
int size = someVector.size();
mutex.unlock();
return size;
}
Is this a place to use volatile keyword while declaring size? Will return value optimization or something else break this code if I don't use volatile? The size of someVector can be changed from any of the numerous threads the program have and it is assumed that only one thread (other than modifiers) calls getNumber().

No. But beware that the size may not reflect the actual size AFTER the mutex is released.
Edit:If you need to do some work that relies on size being correct, you will need to wrap that whole task with a mutex.

You haven't mentioned what the type of the mutex variable is, but assuming it is an std::mutex (or something similar meant to guarantee mutual exclusion), the compiler is prevented from performing a lot of optimizations. So you don't need to worry about return value optimization or some other optimization allowing the size() query from being performed outside of the mutex block.
However, as soon as the mutex lock is released, another waiting thread is free to access the vector and possibly mutate it, thus changing the size. Now, the number returned by your function is outdated. As Mats Petersson mentions in his answer, if this is an issue, then the mutex lock needs to be acquired by the caller of getNumber(), and held until the caller is done using the result. This will ensure that the vector's size does not change during the operation.
Explicitly calling mutex::lock followed by mutex::unlock quickly becomes unfeasible for more complicated functions involving exceptions, multiple return statements etc. A much easier alternative is to use std::lock_guard to acquire the mutex lock.
int getNumber()
{
std::lock_guard<std::mutex> l(mutex); // lock is acquired
int size = someVector.size();
return size;
} // lock is released automatically when l goes out of scope

Volatile is a keyword that you use to tell the compiler to literally actually write or read the variable and not to apply any optimizations. Here is an example
int example_function() {
int a;
volatile int b;
a = 1; // this is ignored because nothing reads it before it is assigned again
a = 2; // same here
a = 3; // this is the last one, so a write takes place
b = 1; // b gets written here, because b is volatile
b = 2; // and again
b = 3; // and again
return a + b;
}
What is the real use of this? I've seen it in delay functions (keep the CPU busy for a bit by making it count up to a number) and in systems where several threads might look at the same variable. It can sometimes help a bit with multi-threaded things, but it isn't really a threading thing and is certainly not a silver bullet

Portable thread-safe lazy singleton

Greetings to all.
I'm trying to write a thread safe lazy singleton for future use. Here's the best I could come up with. Can anyone spot any problems with it? The key assumption is that static initialization occurs in a single thread before dynamic initialisations. (this will be used for a commercial project and company is not using boost :(, life would be a breeze otherwise :)
PS: Haven't check that this compiles yet, my apologies.
/*
There are two difficulties when implementing the singleton pattern:
Problem (a): The "global variable instantiation fiasco". TODO: URL
This is due to the unspecified order in which global variables are initialised. Static class members are equivalent
to a global variable in C++ during initialisation.
Problem (b): Multi-threading.
Care must be taken to ensure that the mutex initialisation is handled properly with respect to problem (a).
*/
/*
Things achieved, maybe:
*) Portable
*) Lazy creation.
*) Safe from unspecified order of global variable initialisation.
*) Thread-safe.
*) Mutex is properly initialise when invoked during global variable intialisation:
*) Effectively lock free in instance().
*/
/************************************************************************************
Platform dependent mutex implementation
*/
class Mutex
{
public:
void lock();
void unlock();
};
/************************************************************************************
Threadsafe singleton
*/
class Singleton
{
public: // Interface
static Singleton* Instance();
private: // Static helper functions
static Mutex* getMutex();
private: // Static members
static Singleton* _pInstance;
static Mutex* _pMutex;
private: // Instance members
bool* _pInstanceCreated; // This is here to convince myself that the compiler is not re-ordering instructions.
private: // Singletons can't be coppied
explicit Singleton();
~Singleton() { }
};
/************************************************************************************
We can't use a static class member variable to initialised the mutex due to the unspecified
order of initialisation of global variables.
Calling this from
*/
Mutex* Singleton::getMutex()
{
static Mutex* pMutex = 0; // alternatively: static Mutex* pMutex = new Mutex();
if( !pMutex )
{
pMutex = new Mutex(); // Constructor initialises the mutex: eg. pthread_mutex_init( ... )
}
return pMutex;
}
/************************************************************************************
This static member variable ensures that we call Singleton::getMutex() at least once before
the main entry point of the program so that the mutex is always initialised before any threads
are created.
*/
Mutex* Singleton::_pMutex = Singleton::getMutex();
/************************************************************************************
Keep track of the singleton object for possible deletion.
*/
Singleton* Singleton::_pInstance = Singleton::Instance();
/************************************************************************************
Read the comments in Singleton::Instance().
*/
Singleton::Singleton( bool* pInstanceCreated )
{
fprintf( stderr, "Constructor\n" );
_pInstanceCreated = pInstanceCreated;
}
/************************************************************************************
Read the comments in Singleton::Instance().
*/
void Singleton::setInstanceCreated()
{
_pInstanceCreated = true;
}
/************************************************************************************
Fingers crossed.
*/
Singleton* Singleton::Instance()
{
/*
'instance' is initialised to zero the first time control flows over it. So
avoids the unspecified order of global variable initialisation problem.
*/
static Singleton* instance = 0;
/*
When we do:
instance = new Singleton( instanceCreated );
the compiler can reorder instructions and any way it wants as long
as the observed behaviour is consistent to that of a single threaded environment ( assuming
that no thread-safe compiler flags are specified). The following is thus not threadsafe:
if( !instance )
{
lock();
if( !instance )
{
instance = new Singleton( instanceCreated );
}
lock();
}
Instead we use:
static bool instanceCreated = false;
as the initialisation indicator.
*/
static bool instanceCreated = false;
/*
Double check pattern with a slight swist.
*/
if( !instanceCreated )
{
getMutex()->lock();
if( !instanceCreated )
{
/*
The ctor keeps a persistent reference to 'instanceCreated'.
In order to convince our-selves of the correct order of initialisation (I think
this is quite unecessary
*/
instance = new Singleton( instanceCreated );
/*
Set the reference to 'instanceCreated' to true.
Note that since setInstanceCreated() actually uses the non-static
member variable: '_pInstanceCreated', I can't see the compiler taking the
liberty to call Singleton's ctor AFTER the following call. (I don't know
much about compiler optimisation, but I doubt that it will break up the ctor into
two functions and call one part of it before the following call and the other part after.
*/
instance->setInstanceCreated();
/*
The double check pattern should now work.
*/
}
getMutex()->unlock();
}
return instance;
}

No, this will not work. It is broken.
The problem has little/nothing to do with the compiler. It has to do with the order in which a second CPU will 'see' what the first CPU has done to memory. The memory (and caches) will be consistent, but the timing of WHEN each CPU decides to write or read each part of memory/cache is indeterminate.
So for CPU1:
instance = new Singleton( instanceCreated );
instance->setInstanceCreated();
Let's consider the compiler first. There is NO reason why the compiler doesn't reorder or otherwise alter these functions. Maybe like:
temp_register = new Singleton(instanceCreated);
temp_register->setInstanceCreated();
instance = temp_register;
or many other possibilities - like you said as long as single-threaded observed behaviour is consistent. This DOES include things like " break up the ctor into two functions and call one part of it before the following call and the other part after."
Now, it probably wouldn't break it up into 2 calls, but it would INLINE the ctor, particularly since it is so small. Then, once inlined, everything may be reordered, as if the ctor was broken in 2, for example.
In general, I would say not only is it possible that the compiler reordered things, it is probable - ie for the code you have, there is probably a reordering (once inlined, and inlining is likely) that is 'better' than the order given by the C++ code.
But let's leave that aside, and try to understand the real issues of double-checked locking.
So, let's just assume the compiler didn't reorder anything. What about the CPU? Or more importantly CPUs - plural.
The first CPU, 'CPU1' needs to follow the instructions given by the compiler, in particular, it needs to write to memory the things it has been told to write:
instance,
instanceCreated
other member variable of the Singleton (ie your Singleton does DO something, and has some state, doesn't it?)
Actually, that 'other member variable' stuff is really important. Important for your singleton - that's its real purpose right?, and important for our discussion. So let's give it a name: important_data. ie instance->important_data. And maybe instance->important_function(), which uses important_data. Etc.
As mentioned, let's assume the compiler has written the code such that these items are written in the order you are expecting, namely:
important_data - written inside the ctor, called from
instance = new Singleton(instanceCreated);
instance - assigned right after new/ctor returns
instanceCreated - inside setInstanceCreated()
Now, the CPU hands these writes off to the memory bus. Know what the memory bus does? IT REORDERS THEM. The CPU and architecture has the same constraints as the compiler - ie make sure this one CPU sees things consistently - ie single threaded consistent. So if, for example, instance and instanceCreated are on the same cache-line (highly likely, actually), they might be written together, and since they were just read, that cache-line is 'hot', so maybe they get written FIRST before important_data, so that that cache-line can be retired to make room for the cache-line where important_data lives.
Did you see that? instanceCreated and instance were just committed to memory BEFORE important_data. Note that CPU1 doesn't care, because it is living in a single-threaded world...
So now introduce CPU2:
CPU2 comes in, sees instanceCreated == true and instance != NULL and thus goes off and decides to call Singleton::Instance()->important_function(), which uses important_data, which is uninitialized. CRASH BANG BOOM.
By the way, it gets worse. So far, we've seen that the compiler could reorder, but we're pretending it didn't. Let's go one step further and pretend that CPU1 did NOT reorder any of the memory writing. Are we OK now?
No. Of course not.
Just as CPU1 decided to optimize/reorder its memory writes, CPU2 can REORDER ITS READS!
CPU2 comes in and sees
if (!instanceCreated) ...
so it needs to read instanceCreated. Ever heard of 'speculative execution'? (Great name for a FPS game, by the way). If the memory bus isn't busy doing anything, CPU2 might pre-read some other values 'hoping' that instanceCreated is true. ie it may pre-read important_data for example. Maybe important_data (or the uninitialized, possibly re-claimed-by-the-allocator memory that will become important_data) is already in CPU2's cache. Or maybe (more likely?) CPU2 just free'd that memory, and the allocator wrote NULL in its first 4 bytes (allocators often use that memory for their free-lists), so actually, the memory soon-to-become important_data may actually still be in the write queue of CPU2. In that case, why would CPU2 bother re-reading that memory, when it hasn't even finished writing it yet!? (it wouldn't - it would just get the values from its write-queue.)
Did that make sense? If not, imagine that the value of instance (which is a pointer) is 0x17e823d0. What was that memory doing before it became (becomes) the Singleton? Is that memory still in the write-queue of CPU2?...
Or basically, don't even think about why it might want to do so, but realize that CPU2 might read important_data first, then instanceCreated second. So even though CPU1 may have wrote them in order CPU2 sees 'crap' in important_data, then sees true in instanceCreated (and who knows what in instance!). Again, CRASH BANG BOOM. Or BOOM CRASH BANG, since by now you realize that the order isn't guaranteed...

It's usually better to have a non-lazy singleton which does nothing in its constructor, and then in GetInstance do a thread-safe call once to a function which allocates any expensive resources. You're already creating a Mutex non-lazily, so why not just put the mutex and some kind of Pimpl in your Singleton object?
By the way, this is easier on Posix:
struct Singleton {
static Singleton *GetInstance() {
pthread_once(&control, doInit);
return instance;
}
private:
static void doInit() {
// slight problem: we can't throw from here, or fail
try {
instance = new Singleton();
} catch (...) {
// we could stash an error indicator in a static member,
// and check it in GetInstance.
std::abort();
}
}
static pthread_once_t control;
static Singleton *instance;
};
pthread_once_t Singleton::control = PTHREAD_ONCE_INIT;
Singleton *Singleton::instance = 0;
There do exist pthread_once implementations for Windows and other platforms.

If you wish to see an in-depth discussion of Singletons, the various policies about their lifetime and the thread safety issues, I can only recommend a good read: "Modern C++ Design" by Alexandrescu.
The implementation is presented on the web in Loki, find it here!
And yes, it does hold in a single header file. So I would really encourage you to at least grab the file and read it, and better yet read the book to have the full-blown reflection.

At global scope in your code:
/************************************************************************************
Keep track of the singleton object for possible deletion.
*/
Singleton* Singleton::_pInstance = Singleton::Instance();
This makes your implementation not lazy. Presumably you want to set _pInstance to NULL at global scope, and assign to it after you construct the singleton inside Instance() before you unlock the mutex.

More food for thought from Meyers & Alexandrescu, with Singleton being the specific target: C++ and the Perils of Double-Checked Locking. It's a bit of a prickly problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js