Should a pointer to stack variable be volatile?

Should a pointer to stack variable be volatile? - c++

I know that I should use the volatile keyword to tell the compiler not to optimize memory read\write to variables. I also know that in most cases it should only be used to talk to non-C++ memory.
However, I would like to know if I have to use volatile when holding a pointer to some local (stack) variable.
For example:
//global or member variable
/* volatile? */bool* p_stop;
void worker()
{
/* volatile? */ bool stop = false;
p_stop = &stop;
while(!stop)
{
//Do some work
//No usage of "stop" or p_stop" here
}
}
void stop_worker()
{
*p_stop = true;
}
It looks to me like a compiler with some optimization level might see that stop is a local variable, that is never changed and could replace while(!stop) with a while(true) and thus changing *p_stop while do nothing.
So, is it required to mark the pointer as volatile in such a case?
P.S: Please do not lecture me on why not to use this, the real code that uses this hack does so for a (complex-to-explain) reason.
EDIT:
I failed to mention that these two functions run on different threads.
The worker() is a function of the first thread, and it should be stopped from another thread using the p_stop pointer.
I am not interested in knowing what better ways there are to solve the real reason that is behind this sort of hack. I simply want to know if this is defined\undefined behavior in C++ (11 for that matter), and also if this is compiler\platform\etc dependent. So far I see #Puppy saying that everyone is wrong and that this is wrong, but without referencing a specific standard that denoted this.
I understand that some of you are offended by the "don't lecture me" part, but please stick to the real question - Should I use volatile or not? or is this UB? and if you can please help me (and others) learn something new by providing a complete answer.

I simply want to know if this is defined\undefined behavior in C++ (11 for that matter)
Ta-da (from N3337, "quasi C++11")
Two expression evaluations conflict if one of them modifies a memory location [..] and the other one accesses or modifies the same memory location.
§1.10/4
and:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. [..]
§1.10/21
You're accessing the (memory location of) object stop from different threads, both accesses are not atomic, thus also in no "happens before" relation. Simply put, you have a data race and thus undefined behavior.
I am not interested in knowing what better ways there are to solve the real reason that is behind this sort of hack.
Atomic operations (as defined by the C++ standard) are the only way to (reliably) solve this.

So, is it required to mark the pointer as volatile in such a case?
No. It's not required, principally because volatile doesn't even remotely cover what you need it to do in this case. You must use an actual synchronization primitive, like an atomic operation or mutex. Using volatile here is undefined behaviour and your program will explode.
volatile is NOT useful for concurrency. It may be useful for implementing concurrent primitives but it is far from sufficient.
Frankly, whether or not you want to use actual synchronization primitives is irrelevant. If you want to write correct code, you have no choice.

P.S: Please do not lecture me on why not to use this,
I am not sure what we are supposed to say. The compiler manages the stack, so anything you are doing with it is technically undefined behavior and may not work when you upgrade to the next version of the compiler.
You are also making assumptions that may be different than the compiler's assumptions when it optimizes. This is the real reason to use (or not use) volatile; you give guidance to the compiler that helps it decide whether optimizations are safe. The use of volatile tells the compiler that it should assume that these variables may change due to external influences (other threads or special hardware behavior).
So yes, in this case, it looks like you would need to mark both p_stop and stop with a volatile qualifier.
(Note: this is necessary but not sufficient, as it does not cause the appropriate behaviors to happen in a language implementation with a relaxed memory model that requires barriers to ensure correctness. See https://en.wikipedia.org/wiki/Memory_ordering#Runtime_memory_ordering )

This question simply cannot be answered from the details provided.
As is stated in the question this is an entirely unsupported way of communicating between threads.
So the only answer is:
Specify the compiler versions you're using and hope someone knows its darkest secrets or refer to your documentation. All the C++ standard will tell you is this won't work and all anyone can tell you is "might work but don't".
There isn't a "oh, come on guys everyone knows it pretty much works what do I do as the workaround? wink wink" answer.
Unless your compiler doesn't support atomics or suitably concurrent mechanisms there is no justifiable reason for doing this.
"It's not supported" isn't "complex-to-explain" so I'd be fascinated based on that code fragment to understand what possible reason there is for not doing this properly (other than ancient compiler).

Related

Could optimization break thread safety?

class Foo{
public:
void fetch(void)
{
int temp=-1;
someSlowFunction(&temp);
bar=temp;
}
int getBar(void)
{
return bar;
}
void someSlowFunction(int *ptr)
{
usleep(10000);
*ptr=0;
}
private:
int bar;
};
I'm new to atomic operations so I may get some concepts wrong.
Considering above code, assuming loading and storing int type are atomic[Note 1], then getBar() could only get the bar before or after a fetch().
However, if a compiler is smart enough, it could optimize away temp and change it to:
void Foo::fetch(void)
{
bar=-1;
someSlowFunction(&bar);
}
Then in this case getBar() could get -1 or other intermediate state inside someSlowFunction() under certain timing conditions.
Is this risk possible? Does the standard prevent such optimizations?
Note 1: http://preshing.com/20130618/atomic-vs-non-atomic-operations/
The language standards have nothing to say about atomicity in this
case. Maybe integer assignment is atomic, maybe it isn’t. Since
non-atomic operations don’t make any guarantees, plain integer
assignment in C is non-atomic by definition.
In practice, we usually know more about our target platforms than
that. For example, it’s common knowledge that on all modern x86, x64,
Itanium, SPARC, ARM and PowerPC processors, plain 32-bit integer
assignment is atomic as long as the target variable is naturally
aligned. You can verify it by consulting your processor manual and/or
compiler documentation. In the games industry, I can tell you that a
lot of 32-bit integer assignments rely on this particular guarantee.
I'm targeting ARM Cortex-A8 here, so I consider this a safe assumption.

Compiler optimization can not break thread safety!
You might however experience issues with optimizations in code that appeared to be thread safe but really only worked because of pure luck.
If you access data from multiple threads, you must either
Protect the appropriate sections using std::mutex or the like.
or, use std::atomic.
If not, the compiler might do optimizations that is next to impossible to expect.
I recommend watching CppCon 2014: Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades), Part I" and Part II

After answering question in comments, it makes more sense. Let's analyze thread-safety here given that fetch() and getBar() are called from different threads. Several points will need to be considered:
'Dirty reads', or garabage reading due to interrupted write. While a general possibility, does not happen on 3 chip families I am familiar with for aligned ints. Let's discard this possibility for now, and just assume read values are alwats clean.
'Improper reads', or an option of reading something from bar which was never written there. Would it be possible? Optimizing away temp on the compiler part is, in my opinion, possible, but I am no expert in this matter. Let's assume it does not happen. The caveat would still be there - you might NEVER see the new value of bar. Not in a reasonable time, simply never.

The compiler can apply any transformation that results in the same observable behavior. Assignments to local non-volatile variables are not part of the observable behavior. The compiler may just decide to eliminate temp completely and just use bar directly. It may also decide that bar will always end up with the value zero, and set at the beginning of the function (at least in your simplified example).
However, as you can read in James' answer on a related question the situation is more complex because modern hardware also optimizes the executed code. This means that the CPU re-orders instructions, and neither the programmer or the compiler has influence on that without using special instructions. You need either a std::atomic, you memory fences explicitly (I wouldn't recommend it because it is quite tricky), or use a mutex which also acts as a memory fence.

It probably wouldn't optimize that way because of the function call in the middle, but you can define temp as volatile, this will tell the compiler not to perform these kinds of optimizations.
Depending on the platform, you can certainly have cases where multibyte quantities are in an inconsistent state. It doesn't even need to be thread related. For example, a device experiencing low voltage during a power brown-out can leave memory in an inconsistent state. If you have pointers getting corrupted, then it's usually bad news.
One way I approached this on a system without mutexes was to ensure every piece of data could be verified. For example, for every datum T, there would be a validation checksum C and a backup U.
A set operation would be as follows:
U = T
T = new value
C = checksum(T)
And a get operation would be as follows:
is checksum(T) == C
yes: return T
no: return U
This guarantees that the whatever is returned is in a consistent state. I would apply this algorithm to the entire OS, so for example, entire files could be restored.
If you want to ensure atomicity without getting into complex mutexes and whatever, try to use the smallest types possible. For example, does bar need to be an int or will unsigned char or bool suffice?

Has anyone ever considered a more "strict" flavor of C++ in which variables are / need to be initialized by default?

While C++ is an extraordinary language it's also a language with many pitfalls, especially for inexperienced programmers. I'm talking about things like uninitialized variables of primitive types in a class, e.g.
class Data {
std::string name;
unsigned int version;
};
// ...
Data data;
if (data.version) { ... } // use of uninitialized member
I know this example is oversimplified but in practice even experienced developers sometimes forget to initialize their member variables in constructors. While leaving primitives uninitialized by default is probably a relic from C, it provides us the choice between performance (leave some data uninitialized) and correctness (initialize all data).
OK, but what if the logic was inverted? I mean what if all primitives were either initialized with zeros? Or would require explicit initialization whose lack would generate a compile error. Of course for full flexibility one would have a special syntax/type to leave a variable/member uninitialized, e.g.
unsigned int x = std::uninitialized_value;
or
Data::Data() : name(), version(std::uninitialized_value) {}
I understand this could cause problems with existing C++ code which allows uninitialized data but the new code could be wrapped in a special block (extern "C" comes to me mind as an example) to let the compiler know a particular piece of code shall be strictly checked for uninitialized data.
Putting compatibility issues aside, such an approach would result in less bugs in our code, which is what we are all interested in.
Have you ever heard about any proposal like this?
Does such a proposal make sense at all?
Do you see any downsides of this approach?
Note 1: I used the term "strict" as this idea is related to the "strict mode" from JavaScript language which as mentioned on Mozilla Developer Network site
eliminates some JavaScript silent errors by changing them to throw
errors
Note 2: Please don't pay attention to the proposed syntax used in the proposal, it's there just to make a point.
Note 3: I'm aware of the fact that tools like cppcheck can easily find uninitialized member variables but my idea is about compile-time support for this kind of checks.

Use -Werror=uninitialized, it does exactly what you want.
You can then “unitialize” a variable with unsigned int x = x;

Have you ever heard about any proposal like this?
Does such a proposal make sense at all?
Do you see any downsides of this approach?
I haven't heard of any such proposal. To answer the next two, I'll first broadly claim that neither makes sense. To more specifically answer, we need to break the proposal down into two separate proposals:
No uninitialized variables of any kind are allowed.
"Uninitialized" variables are secretly initialized to some well-defined value (e.g. as in Java, which uses false, 0, or null).
Case 1 has a simple counterexample: performance. Examples abound; here's a (simplified) pattern I use in my code:
Foo x;
if (/*something*/) {
x = Foo(...);
} else {
x = Foo(...);
}
//[do something with x]
x is "uninitialized" here. The default constructor does get called (if Foo is an object), but this might just do nothing. If it's a big structure, then all I really pay for is the cost of the assignment operator--not the cost of some nontrivial constructor plus the cost of the assignment operator. If it's a typedef for an int or something, I need to load it into cache before xor eax,eaxing or whatever to clear it. If there's more code in the if-blocks, that's potentially a valuable cache miss if the compiler can't elide it.
Case 2 is trickier.
It turns out that modern operating systems actually do change the value of memory when it is allocated to processes (it's a security thing). So, there is a rudimentary form of this happening anyway.
You mention in particular that such an approach would result in [fewer] bugs in our code. I fundamentally disagree. Why would magically initializing things to some well-defined value make our code more robust? In fact, this semi-automatic initialization causes an enormous quantity of errors.
Storytime!
Exactly how this works depends on how the program is compiled (so debug vs. release and compiler). When I was first learning C++, I wrote a software rasterizer and concluded it worked--since it worked perfectly in debug mode. But, the instant I switched to release mode, I got a completely different result. Had the OS not initialized everything to zero so consistently in debug mode, I might have realized this sooner. This is an example of a bug caused by what you suggest.
By some miracle I managed to re-find this depressing question, which demonstrates similar confusions. This is a widespread problem.
Nowadays, some debugging environments put debug values into memory. E.g. MSVC puts 0xDEADBEEF and 0xFEEEFEEE. Lots more here. This, in combination with some OS sorcery, allows them to find use of uninitialized values. Running your code in a VM (e.g. Valgrind) gives you the same effect for free.
The larger point here is that, in my opinion, automatically initializing something to a well-defined value when you forget to initialize it is just as bad (if not worse) than getting some bogus value. The problem is the programmer is expecting something when he has no justification to--not that the value he's expecting is well-defined.

Can adding 'const' to a pointer help the optimization?

I have a pointer int* p, and do some operations in a loop. I do not modify the memory, just read. If I add const to the pointer (both cases, const int* p, and int* const p), can it help a compiler to optimize the code?
I know other merits of const, like safety or self-documentation, I ask about this particular case. Rephrasing the question: can const give the compiler any useful (for optimization) information, ever?

While this is obviously specific to the implementation, it is hard to see how changing a pointer from int* to int const* could ever provide any additional information that the compiler would not otherwise have known.
In both cases, the value pointed to can change during the execution of the loop.
Therefore it probably will not help the compiler optimize the code.

No. Using const like that will not provide the compiler with any information that can be used for optimization.
Theoretically, for it to be a help to your compiler, the optimizer must be able to prove that nobody will ever use const_cast on your const pointer yet be unable to otherwise prove that the variable is never written to. Such a situation is highly unlikely.
Herb Sutter covers this is more depth in one of his Guru of the Week columns.

It can help or it can make no difference or it can make it worse. The only way to know is to try both and inspect the emitted machine code.
Modern compilers are very smart so they can often deduce that memory is unchanged without any qualifiers (pr they can deduce many other optimizations are possible without code being written in manner easier to analyze) yet they are rather complex and so have a lot of deficiencies and often can't optimize every possible thing at every opportunity.

I think the compiler can't do much in your scenario. The fact that your pointer declared as const int * const p doesn't guarantee that the memory can't be changed externally, e.g. by another thread. Therefore the compiler must generate code that reads the memory value on each iteration of your loop.
But if you are not going to write to the memory location and you know that no other piece of code will, then you can create a local variable and use it similar to this:
const int * p = ...
...
int val = *p;
/* use the value in a loop */
for (i = 0; i < BAZILLION; i++)
{
use_value(val);
}
Not only you help potential readers of your code to see that the val is not changed in a loop, but you also give the compiler a possibility to optimize (load val in a register, for instance).

Using const is, as everyone else has said, unlikely to help the compiler optimize your loop.
It may, however, help optimise code outside the loop, or at the site of a call to a const-qualified method, or to a function taking const arguments.
This is likely to depend on whether the compiler can prove it's allowed to eliminate redundant loads, move them around, or cache calculated values rather than re-calculating them.
The only way to prove this is still to profile and/or check the assembly, but that's where you should probably be looking.

You don't say which compiler you are using. But if you are both reading and writing to memory you could benefit from using "restrict" or similar. The compiler does not know if your pointers are aliasing the same memory so any store often forces loading other values again. "restrict" tells the compiler that no aliasing of the pointer is happening and can keep using values loaded before a subsequent write. Another way to avoid the aliasing issue is to load your values into local variables then the compiler is not forced to reload after a write.

Is std::string thead-safe with gcc 4.3?

I'm developing a multithreaded program running on Linux (compiled with G++ 4.3) and if you search around for a bit you find a lot of scary stories about std::string not being thread-safe with GCC. This is supposedly due to the fact that internally it uses copy-on-write which wreaks havoc with tools like Helgrind.
I've made a small program that copies one string to another string and if you inspect both strings they both share the same internal _M_p pointer. When one string is modified the pointer changes so the copy-on-write stuff is working fine.
What I'm worried about though is what happens if I share a string between two threads (for instance passing it as an object in a threadsafe dataqueue between two threads). I've already tried compiling with the '-pthread' option but that does not seem to make much difference. So my question:
Is there any way to force std::string to be threadsafe? I would not mind if the copy-on-write behaviour was disabled to achieve this.
How have other people solved this? Or am I being paranoid?
I can't seem to find a definitive answer so I hope you guys can help me..
Edit:
Wow, that's a whole lot of answers in such a short time. Thank you! I will definitely use Jack's solution when I want to disable COW. But now the main question becomes: do I really have to disable COW? Or is the 'bookkeeping' done for COW thread safe? I'm currently browsing the libstdc++ sources but that's going to take quite some time to figure out...
Edit 2
OK browsed the libstdc++ source code and I find something like this in libstd++-v3/include/bits/basic_string.h:
_CharT*
_M_refcopy() throw()
{
#ifndef _GLIBCXX_FULLY_DYNAMIC_STRING
if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
__gnu_cxx::__atomic_add_dispatch(&this->_M_refcount, 1);
return _M_refdata();
} // XXX MT
So there is definitely something there about atomic changes to the reference counter...
Conclusion
I'm marking sellibitze's comment as answer here because I think we've reached the conclusion that this area is still unresolved for now. To circumvent the COW behaviour I'd suggest Jack Lloyd's answer. Thank you everybody for an interesting discussion!

Threads are not yet part of the standard. But I don't think that any vendor can get away without making std::string thread-safe, nowadays. Note: There are different definitions of "thread-safe" and mine might differ from yours. Of course, it makes little sense to protect a container like std::vector for concurrent access by default even when you don't need it. That would go against the "don't pay for things you don't use" spirit of C++. The user should always be responsible for synchronization if he/she wants to share objects among different threads. The issue here is whether a library component uses and shares some hidden data structures that might lead to data races even if "functions are applied on different objects" from a user's perspective.
The C++0x draft (N2960) contains the section "data race avoidance" which basically says that library components may access shared data that is hidden from the user if and only if it activly avoids possible data races. It sounds like a copy-on-write implementation of std::basic_string must be as safe w.r.t. multi-threading as another implementation where internal data is never shared among different string instances.
I'm not 100% sure about whether libstdc++ takes care of it already. I think it does. To be sure, check out the documentation

If you don't mind disabling copy-on-write, this may be the best course of action. std::string's COW only works if it knows that it is copying another std::string, so you can cause it to always allocate a new block of memory and make an actual copy. For instance this code:
#include <string>
#include <cstdio>
int main()
{
std::string orig = "I'm the original!";
std::string copy_cow = orig;
std::string copy_mem = orig.c_str();
std::printf("%p %p %p\n", orig.data(),
copy_cow.data(),
copy_mem.data());
}
will show that the second copy (using c_str) prevents COW. (Because the std::string only sees a bare const char*, and has no idea where it came from or what its lifetime might be, so it has to make a new private copy).

This section of the libstdc++ internals states:
The C++ library string functionality
requires a couple of atomic operations
to provide thread-safety. If you don't
take any special action, the library
will use stub versions of these
functions that are not thread-safe.
They will work fine, unless your
applications are multi-threaded.
The reference counting should work in a multi-threaded environment. (unless your system doesn't provide the necessary atomics)

No STL container is thread safe. This way, the library has a general purpose (both to be used in single threading mode, or multi threading mode). In multithreading, you'll need to add the synchronization mechanism.

It seems that this was fixed a while ago: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=5444 was (closed as a the same issue than http://gcc.gnu.org/bugzilla/show_bug.cgi?id=5432, which was fixed in 3.1).
See also http://gcc.gnu.org/bugzilla/show_bug.cgi?id=6227

According to this bug issue, std::basic_string's copy-on-write implementation still isn't fully thread-safe. <ext/vstring.h> is an implementation without COW and seems to do much better in a read-only context.

Thread safe lazy construction of a singleton in C++

Is there a way to implement a singleton object in C++ that is:
Lazily constructed in a thread safe manner (two threads might simultaneously be the first user of the singleton - it should still only be constructed once).
Doesn't rely on static variables being constructed beforehand (so the singleton object is itself safe to use during the construction of static variables).
(I don't know my C++ well enough, but is it the case that integral and constant static variables are initialized before any code is executed (ie, even before static constructors are executed - their values may already be "initialized" in the program image)? If so - perhaps this can be exploited to implement a singleton mutex - which can in turn be used to guard the creation of the real singleton..)
Excellent, it seems that I have a couple of good answers now (shame I can't mark 2 or 3 as being the answer). There appears to be two broad solutions:
Use static initialisation (as opposed to dynamic initialisation) of a POD static variable, and implementing my own mutex with that using the builtin atomic instructions. This was the type of solution I was hinting at in my question, and I believe I knew already.
Use some other library function like pthread_once or boost::call_once. These I certainly didn't know about - and am very grateful for the answers posted.

Basically, you're asking for synchronized creation of a singleton, without using any synchronization (previously-constructed variables). In general, no, this is not possible. You need something available for synchronization.
As for your other question, yes, static variables which can be statically initialized (i.e. no runtime code necessary) are guaranteed to be initialized before other code is executed. This makes it possible to use a statically-initialized mutex to synchronize creation of the singleton.
From the 2003 revision of the C++ standard:
Objects with static storage duration (3.7.1) shall be zero-initialized (8.5) before any other initialization takes place. Zero-initialization and initialization with a constant expression are collectively called static initialization; all other initialization is dynamic initialization. Objects of POD types (3.9) with static storage duration initialized with constant expressions (5.19) shall be initialized before any dynamic initialization takes place. Objects with static storage duration defined in namespace scope in the same translation unit and dynamically initialized shall be initialized in the order in which their definition appears in the translation unit.
If you know that you will be using this singleton during the initialization of other static objects, I think you'll find that synchronization is a non-issue. To the best of my knowledge, all major compilers initialize static objects in a single thread, so thread-safety during static initialization. You can declare your singleton pointer to be NULL, and then check to see if it's been initialized before you use it.
However, this assumes that you know that you'll use this singleton during static initialization. This is also not guaranteed by the standard, so if you want to be completely safe, use a statically-initialized mutex.
Edit: Chris's suggestion to use an atomic compare-and-swap would certainly work. If portability is not an issue (and creating additional temporary singletons is not a problem), then it is a slightly lower overhead solution.

Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model. (It is supported by the Java 1.5 and later — and I think .NET — memory model.) This means that between the time when the pObj == NULL check takes place and when the lock (mutex) is acquired, pObj may have already been assigned on another thread. Thread switching happens whenever the OS wants it to, not between "lines" of a program (which have no meaning post-compilation in most languages).
Furthermore, as Matt acknowledges, he uses an int as a lock rather than an OS primitive. Don't do that. Proper locks require the use of memory barrier instructions, potentially cache-line flushes, and so on; use your operating system's primitives for locking. This is especially important because the primitives used can change between the individual CPU lines that your operating system runs on; what works on a CPU Foo might not work on CPU Foo2. Most operating systems either natively support POSIX threads (pthreads) or offer them as a wrapper for the OS threading package, so it's often best to illustrate examples using them.
If your operating system offers appropriate primitives, and if you absolutely need it for performance, instead of doing this type of locking/initialization you can use an atomic compare and swap operation to initialize a shared global variable. Essentially, what you write will look like this:
MySingleton *MySingleton::GetSingleton() {
if (pObj == NULL) {
// create a temporary instance of the singleton
MySingleton *temp = new MySingleton();
if (OSAtomicCompareAndSwapPtrBarrier(NULL, temp, &pObj) == false) {
// if the swap didn't take place, delete the temporary instance
delete temp;
}
}
return pObj;
}
This only works if it's safe to create multiple instances of your singleton (one per thread that happens to invoke GetSingleton() simultaneously), and then throw extras away. The OSAtomicCompareAndSwapPtrBarrier function provided on Mac OS X — most operating systems provide a similar primitive — checks whether pObj is NULL and only actually sets it to temp to it if it is. This uses hardware support to really, literally only perform the swap once and tell whether it happened.
Another facility to leverage if your OS offers it that's in between these two extremes is pthread_once. This lets you set up a function that's run only once - basically by doing all of the locking/barrier/etc. trickery for you - no matter how many times it's invoked or on how many threads it's invoked.

Here's a very simple lazily constructed singleton getter:
Singleton *Singleton::self() {
static Singleton instance;
return &instance;
}
This is lazy, and the next C++ standard (C++0x) requires it to be thread safe. In fact, I believe that at least g++ implements this in a thread safe manner. So if that's your target compiler or if you use a compiler which also implements this in a thread safe manner (maybe newer Visual Studio compilers do? I don't know), then this might be all you need.
Also see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2513.html on this topic.

You can't do it without any static variables, however if you are willing to tolerate one, you can use Boost.Thread for this purpose. Read the "one-time initialisation" section for more info.
Then in your singleton accessor function, use boost::call_once to construct the object, and return it.

For gcc, this is rather easy:
LazyType* GetMyLazyGlobal() {
static const LazyType* instance = new LazyType();
return instance;
}
GCC will make sure that the initialization is atomic. For VC++, this is not the case. :-(
One major issue with this mechanism is the lack of testability: if you need to reset the LazyType to a new one between tests, or want to change the LazyType* to a MockLazyType*, you won't be able to. Given this, it's usually best to use a static mutex + static pointer.
Also, possibly an aside: It's best to always avoid static non-POD types. (Pointers to PODs are OK.) The reasons for this are many: as you mention, initialization order isn't defined -- neither is the order in which destructors are called though. Because of this, programs will end up crashing when they try to exit; often not a big deal, but sometimes a showstopper when the profiler you are trying to use requires a clean exit.

While this question has already been answered, I think there are some other points to mention:
If you want lazy-instantiation of the singleton while using a pointer to a dynamically allocated instance, you'll have to make sure you clean it up at the right point.
You could use Matt's solution, but you'd need to use a proper mutex/critical section for locking, and by checking "pObj == NULL" both before and after the lock. Of course, pObj would also have to be static ;)
.
A mutex would be unnecessarily heavy in this case, you'd be better going with a critical section.
But as already stated, you can't guarantee threadsafe lazy-initialisation without using at least one synchronisation primitive.
Edit: Yup Derek, you're right. My bad. :)

You could use Matt's solution, but you'd need to use a proper mutex/critical section for locking, and by checking "pObj == NULL" both before and after the lock. Of course, pObj would also have to be static ;) . A mutex would be unnecessarily heavy in this case, you'd be better going with a critical section.
OJ, that doesn't work. As Chris pointed out, that's double-check locking, which is not guaranteed to work in the current C++ standard. See: C++ and the Perils of Double-Checked Locking
Edit: No problem, OJ. It's really nice in languages where it does work. I expect it will work in C++0x (though I'm not certain), because it's such a convenient idiom.

read on weak memory model. It can break double-checked locks and spinlocks. Intel is strong memory model (yet), so on Intel it's easier
carefully use "volatile" to avoid caching of parts the object in registers, otherwise you'll have initialized the object pointer, but not the object itself, and the other thread will crash
the order of static variables initialization versus shared code loading is sometimes not trivial. I've seen cases when the code to destruct an object was already unloaded, so the program crashed on exit
such objects are hard to destroy properly
In general singletons are hard to do right and hard to debug. It's better to avoid them altogether.

I suppose saying don't do this because it's not safe and will probably break more often than just initializing this stuff in main() isn't going to be that popular.
(And yes, I know that suggesting that means you shouldn't attempt to do interesting stuff in constructors of global objects. That's the point.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js