Two Different Processes With 2 std::atomic Variables on Same Address?

Two Different Processes With 2 std::atomic Variables on Same Address? - c++

I read C++ Standard (n4713)'s § 32.6.1 3:
Operations that are lock-free should also be address-free. That is,
atomic operations on the same memory location via two different
addresses will communicate atomically. The implementation should not
depend on any per-process state. This restriction enables
communication by memory that is mapped into a process more than once
and by memory that is shared between two processes.
So it sounds like it is possible to perform a lock-free atomic operation on the same memory location. I wonder how it can be done.
Let's say I have a named shared memory segment on Linux (via shm_open() and mmap()). How can I perform a lockfree operation on the first 4 bytes of the shared memory segment for example?
At first, I thought I could just reinterpret_cast the pointer to std::atomic<int32_t>*. But then I read this. It first points out that std::atomic might not have the same size of T or alignment:
When we designed the C++11 atomics, I was under the misimpression that
it would be possible to semi-portably apply atomic operations to data
not declared to be atomic, using code such as
int x; reinterpret_cast<atomic<int>&>(x).fetch_add(1);
This would clearly fail if the representations of atomic and int
differ, or if their alignments differ. But I know that this is not an
issue on platforms I care about. And, in practice, I can easily test
for a problem by checking at compile time that sizes and alignments
match.
Tho, it is fine with me in this case because I use a shared memory on the same machine and casting the pointer in two different processes will "acquire" the same location. However, the article states that the compiler might not treat the casted pointer as a pointer to an atomic type:
However this is not guaranteed to be reliable, even on platforms on
which one might expect it to work, since it may confuse type-based
alias analysis in the compiler. A compiler may assume that an int is
not also accessed as an atomic<int>. (See 3.10, [Basic.lval], last
paragraph.)
Any input is welcome!

The C++ standard doesn't concern itself with multiple processes and no guarantees were given outside of a multi-threaded environment.
However, the standard does recommend that implementations of lock-free atomics be usable across processes, which is the case in most real implementations.
This answer will assume atomics behave more or less the same with processes as with threads.
The first solution requires C++20 atomic_ref
void* shared_mem = /* something */
auto p1 = new (shared_mem) int; // For creating the shared object
auto p2 = (int*)shared_mem; // For getting the shared object
std::atomic_ref<int> i{p2}; // Use i as if atomic<int>
You need to make sure the shared int has std::atomic_ref<int>::required_alignment alignment; typically the same as sizeof(int). Normally you'd use alignas() on a struct member or variable, but in shared memory the layout is up to you (relative to a known page boundary).
This prevents the presence of opaque atomic types existing in the shared memory, which gives you precise control over what exactly goes in there.
A solution prior C++20 would be
auto p1 = new (shared_mem) atomic<int>; // For creating the shared object
auto p2 = (atomic<int>*)shared_mem; // For getting the shared object
auto& i = *p2;
Or using C11 atomic_load and atomic_store
_Atomic int* i = (_Atomic int*)shared_mem;
atomic_store(i, 42);
int i2 = atomic_load(i);
Alignment requirements are the same here, alignof(std::atomic<int>) or _Alignof(atomic_int).

Yes, the C++ standard is a bit mealy-mouthed about all this.
If you are on Windows (which you probably aren't) then you can use InterlockedExchange() etc, which offer all the required semantics and don't care where the referenced object is (it's a LONG *).
On other platforms, gcc has some atomic builtins which might help with this. They might free you from the tyranny of the standards writers. Trouble is, it's hard to test if the resulting code is bullet-proof.

On all mainstream platforms, std::atomic<T> does have the same size as T, although possibly higher alignment requirement if T has alignof < sizeof.
You can check these assumptions with:
static_assert(sizeof(T) == sizeof(std::atomic<T>),
"atomic<T> isn't the same size as T");
static_assert(std::atomic<T>::is_always_lock_free, // C++17
"atomic<T> isn't lock-free, unusable on shared mem");
auto atomic_ptr = static_cast<atomic<int>*>(some_ptr);
// beware strict-aliasing violations
// don't also access the same memory via int*
// unless you're aware of possible issues
// also make sure that the ptr is aligned to alignof(atomic<T>)
// otherwise you might get tearing (non-atomicity)
On exotic C++ implementations where these aren't true, people that want to use your code on shared memory will need to do something else.
Or if all accesses to shared memory from all processes consistently use atomic<T> then there's no problem, you only need lock-free to guarantee address-free. (You do need to check this: std::atomic uses a hash table of locks for non-lock-free. This is address-dependent, and separate processes will have separate hash tables of locks.)

Related

What is C++20 std::atomic<shared_ptr<T>> and std::atomic<weak_ptr<T>>? [duplicate]

I read the following article by Antony Williams and as I understood in addition to the atomic shared count in std::shared_ptr in std::experimental::atomic_shared_ptr the actual pointer to the shared object is also atomic?
But when I read about reference counted version of lock_free_stack described in Antony's book about C++ Concurrency it seems for me that the same aplies also for std::shared_ptr, because functions like std::atomic_load, std::atomic_compare_exchnage_weak are applied to the instances of std::shared_ptr.
template <class T>
class lock_free_stack
{
public:
void push(const T& data)
{
const std::shared_ptr<node> new_node = std::make_shared<node>(data);
new_node->next = std::atomic_load(&head_);
while (!std::atomic_compare_exchange_weak(&head_, &new_node->next, new_node));
}
std::shared_ptr<T> pop()
{
std::shared_ptr<node> old_head = std::atomic_load(&head_);
while(old_head &&
!std::atomic_compare_exchange_weak(&head_, &old_head, old_head->next));
return old_head ? old_head->data : std::shared_ptr<T>();
}
private:
struct node
{
std::shared_ptr<T> data;
std::shared_ptr<node> next;
node(const T& data_) : data(std::make_shared<T>(data_)) {}
};
private:
std::shared_ptr<node> head_;
};
What is the exact difference between this two types of smart pointers, and if pointer in std::shared_ptr instance is not atomic, why it is possible the above lock free stack implementation?

The atomic "thing" in shared_ptr is not the shared pointer itself, but the control block it points to. meaning that as long as you don't mutate the shared_ptr across multiple threads, you are ok. do note that copying a shared_ptr only mutates the control block, and not the shared_ptr itself.
std::shared_ptr<int> ptr = std::make_shared<int>(4);
for (auto i =0;i<10;i++){
std::thread([ptr]{ auto copy = ptr; }).detach(); //ok, only mutates the control block
}
Mutating the shared pointer itself, such as assigning it different values from multiple threads, is a data race, for example:
std::shared_ptr<int> ptr = std::make_shared<int>(4);
std::thread threadA([&ptr]{
ptr = std::make_shared<int>(10);
});
std::thread threadB([&ptr]{
ptr = std::make_shared<int>(20);
});
Here, we are mutating the control block (which is ok) but also the shared pointer itself, by making it point to a different values from multiple threads. This is not ok.
A solution to that problem is to wrap the shared_ptr with a lock, but this solution is not so scalable under some contention, and in a sense, loses the automatic feeling of the standard shared pointer.
Another solution is to use the standard functions you quoted, such as std::atomic_compare_exchange_weak. This makes the work of synchronizing shared pointers a manual one, which we don't like.
This is where atomic shared pointer comes to play. You can mutate the shared pointer from multiple threads without fearing a data race and without using any locks. The standalone functions will be members ones, and their use will be much more natural for the user. This kind of pointer is extremely useful for lock-free data structures.

N4162(pdf), the proposal for atomic smart pointers, has a good explanation. Here's a quote of the relevant part:
Consistency. As far as I know, the [util.smartptr.shared.atomic]
functions are the only atomic operations in the standard that
are not available via an atomic type. And for all types
besides shared_ptr, we teach programmers to use atomic types
in C++, not atomic_* C-style functions. And that’s in part because of...
Correctness. Using the free functions makes code error-prone
and racy by default. It is far superior to write atomic once on
the variable declaration itself and know all accesses
will be atomic, instead of having to remember to use the atomic_*
operation on every use of the object, even apparently-plain reads.
The latter style is error-prone; for example, “doing it wrong” means
simply writing whitespace (e.g., head instead of atomic_load(&head) ),
so that in this style every use of the variable is “wrong by default.” If you forget to
write the atomic_* call in even one place, your code will still
successfully compile without any errors or warnings, it will “appear
to work” including likely pass most testing, but will still contain a
silent race with undefined behavior that usually surfaces as intermittent
hard-to-reproduce failures, often/usually in the field,
and I expect also in some cases exploitable vulnerabilities.
These classes of errors are eliminated by simply declaring the variable atomic,
because then it’s safe by default and to write the same set of
bugs requires explicit non-whitespace code (sometimes explicit
memory_order_* arguments, and usually reinterpret_casting).
Performance. atomic_shared_ptr<> as a distinct type
has an important efficiency advantage over the
functions in [util.smartptr.shared.atomic] — it can simply store an
additional atomic_flag (or similar) for the internal spinlock
as usual for atomic<bigstruct>. In contrast, the existing standalone functions
are required to be usable on any arbitrary shared_ptr
object, even though the vast majority of shared_ptrs will
never be used atomically. This makes the free functions inherently
less efficient; for example, the implementation could require
every shared_ptr to carry the overhead of an internal spinlock
variable (better concurrency, but significant overhead per
shared_ptr), or else the library must maintain a lookaside data
structure to store the extra information for shared_ptrs that are
actually used atomically, or (worst and apparently common in
practice) the library must use a global spinlock.

Calling std::atomic_load() or std::atomic_compare_exchange_weak() on a shared_ptr is functionally equivalent to calling atomic_shared_ptr::load() or atomic_shared_ptr::atomic_compare_exchange_weak(). There shouldn't be any performance difference between the two. Calling std::atomic_load() or std::atomic_compare_exchange_weak() on a atomic_shared_ptr would be syntactically redundant and might or might not incur a performance penalty.

atomic_shared_ptr is an API refinement. shared_ptr already supports atomic operations, but only when using the appropriate atomic non-member functions. This is error-prone, because the non-atomic operations remain available and are too easy for an unwary programmer to invoke by accident. atomic_shared_ptr is less error-prone because it doesn't expose any non-atomic operations.
shared_ptr and atomic_shared_ptr expose different APIs, but they don't necessarily need to be implemented differently; shared_ptr already supports all the operations exposed by atomic_shared_ptr. Having said that, the atomic operations of shared_ptr are not as efficient as they could be, because it must also support non-atomic operations. Therefore there are performance reasons why atomic_shared_ptr could be implemented differently. This is related to the single responsibility principle. "An entity with several disparate purposes... often offers crippled interfaces for any of its specific purposes because the partial overlap among various areas of functionality blurs the vision needed for crisply implementing each." (Sutter & Alexandrescu 2005, C++ Coding Standards)

Could optimization break thread safety?

class Foo{
public:
void fetch(void)
{
int temp=-1;
someSlowFunction(&temp);
bar=temp;
}
int getBar(void)
{
return bar;
}
void someSlowFunction(int *ptr)
{
usleep(10000);
*ptr=0;
}
private:
int bar;
};
I'm new to atomic operations so I may get some concepts wrong.
Considering above code, assuming loading and storing int type are atomic[Note 1], then getBar() could only get the bar before or after a fetch().
However, if a compiler is smart enough, it could optimize away temp and change it to:
void Foo::fetch(void)
{
bar=-1;
someSlowFunction(&bar);
}
Then in this case getBar() could get -1 or other intermediate state inside someSlowFunction() under certain timing conditions.
Is this risk possible? Does the standard prevent such optimizations?
Note 1: http://preshing.com/20130618/atomic-vs-non-atomic-operations/
The language standards have nothing to say about atomicity in this
case. Maybe integer assignment is atomic, maybe it isn’t. Since
non-atomic operations don’t make any guarantees, plain integer
assignment in C is non-atomic by definition.
In practice, we usually know more about our target platforms than
that. For example, it’s common knowledge that on all modern x86, x64,
Itanium, SPARC, ARM and PowerPC processors, plain 32-bit integer
assignment is atomic as long as the target variable is naturally
aligned. You can verify it by consulting your processor manual and/or
compiler documentation. In the games industry, I can tell you that a
lot of 32-bit integer assignments rely on this particular guarantee.
I'm targeting ARM Cortex-A8 here, so I consider this a safe assumption.

Compiler optimization can not break thread safety!
You might however experience issues with optimizations in code that appeared to be thread safe but really only worked because of pure luck.
If you access data from multiple threads, you must either
Protect the appropriate sections using std::mutex or the like.
or, use std::atomic.
If not, the compiler might do optimizations that is next to impossible to expect.
I recommend watching CppCon 2014: Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades), Part I" and Part II

After answering question in comments, it makes more sense. Let's analyze thread-safety here given that fetch() and getBar() are called from different threads. Several points will need to be considered:
'Dirty reads', or garabage reading due to interrupted write. While a general possibility, does not happen on 3 chip families I am familiar with for aligned ints. Let's discard this possibility for now, and just assume read values are alwats clean.
'Improper reads', or an option of reading something from bar which was never written there. Would it be possible? Optimizing away temp on the compiler part is, in my opinion, possible, but I am no expert in this matter. Let's assume it does not happen. The caveat would still be there - you might NEVER see the new value of bar. Not in a reasonable time, simply never.

The compiler can apply any transformation that results in the same observable behavior. Assignments to local non-volatile variables are not part of the observable behavior. The compiler may just decide to eliminate temp completely and just use bar directly. It may also decide that bar will always end up with the value zero, and set at the beginning of the function (at least in your simplified example).
However, as you can read in James' answer on a related question the situation is more complex because modern hardware also optimizes the executed code. This means that the CPU re-orders instructions, and neither the programmer or the compiler has influence on that without using special instructions. You need either a std::atomic, you memory fences explicitly (I wouldn't recommend it because it is quite tricky), or use a mutex which also acts as a memory fence.

It probably wouldn't optimize that way because of the function call in the middle, but you can define temp as volatile, this will tell the compiler not to perform these kinds of optimizations.
Depending on the platform, you can certainly have cases where multibyte quantities are in an inconsistent state. It doesn't even need to be thread related. For example, a device experiencing low voltage during a power brown-out can leave memory in an inconsistent state. If you have pointers getting corrupted, then it's usually bad news.
One way I approached this on a system without mutexes was to ensure every piece of data could be verified. For example, for every datum T, there would be a validation checksum C and a backup U.
A set operation would be as follows:
U = T
T = new value
C = checksum(T)
And a get operation would be as follows:
is checksum(T) == C
yes: return T
no: return U
This guarantees that the whatever is returned is in a consistent state. I would apply this algorithm to the entire OS, so for example, entire files could be restored.
If you want to ensure atomicity without getting into complex mutexes and whatever, try to use the smallest types possible. For example, does bar need to be an int or will unsigned char or bool suffice?

Thread-safe access to class members

Is accessing two different class members of the same object from two different POSIX threads at the same time considered to be thread-safe in C++ 03?

No. (with a little voice of "yes")
From the point of view of the C++03 standard, no such thing as threads exists, so there exist no conditions whatsoever under which the standard would consider anything involving concurrency as "safe".
While this is often no problem (with a little care and proper synchronization primitives that are outside the scope of C++, it will "work anyway"), there are a few things to be aware of, among these:
errno (and other structures) might not be thread-local. The -pthread command line option mostly addresses this.
Class members may alias each other through references, pointers, or unions, so mutating different members might indeed mutate the same member concurrently
Without memory model, the compiler is allowed to (and will!) reorder loads and stores, which means that for example the "obvious" way of communicating by first writing a piece of data, and then setting a "data is ready" flag may not work as expected.
Under Windows, there exist some not-immediately-obvious static-dynamic CRT issues in presence of threading when your program loads DLLs. Be sure all components do "the same thing" (whatever it is).
Also, some old versions of the CRT may leak a few hundred bytes of memory per thread (usually not an issue).
Immutable objects are inherently thread-safe, as is read-only access from several threads.

How can C++ compilers support C++11 atomic, but not support C++11 memory model

While looking at Clang and g++ C++11 implementation status I noticed something strange:
they support C++11 atomics, but they dont support C++11 memory model.
I was under impression that you must have C++11 memory model to use atomics.
So what exactly is the difference between support for atomics and memory model?
Does a lack of memory model support means that legal C++11 programs that use std::atomic<T> arent seq consistent?
references:
http://clang.llvm.org/cxx_status.html
http://gcc.gnu.org/gcc-4.7/cxx0x_status.html

One of the issues is the definition of "memory location", that allows (and forces the compiler to support) locking different structure members by different locks. There is a discussion about a RL problem caused by this.
Basically the issue is that having a struct defined like this:
struct x {
long a;
unsigned int b1;
unsigned int b2:1;
};
the compiler is free to implement writing to b2 by overwriting b1 too (and apparently, judging from the report, it does). Therefore, the two fields have to be locked as one. However, as a consequence of the C++11 memory model, this is forbidden (well, not really forbidden, but the compiler must ensure simultaneous updates to b1 and b2 do not interfere; it could do it by locking or CAS-ing each such update, well, life is difficult on some architectures). Quoting from the report:
I've raised the issue with our GCC guys and they said to me that: "C does
not provide such guarantee, nor can you reliably lock different
structure fields with different locks if they share naturally aligned
word-size memory regions. The C++11 memory model would guarantee this,
but that's not implemented nor do you build the kernel with a C++11
compiler."
Nice info can also be found in the wiki.

I guess the "Lack of memory model" in these cases just means that the optimizers were written before the C++11 memory model got published, and might perform now invalid optimizations. It's very difficult and time-consuming to validate optimizations against the memory model, so it's no big surprise that the clang/gcc teams haven't finished that yet.
Does a lack of memory model support means that legal C++11 programs that use std::atomic arent seq consistent?
Yes, that's a possibility. It's even worse: the compiler might introduce data races into (according to the C++11 standard) race-free programs, e.g. by introducing speculative writes.
For example, several C++ compilers used to perform this optimization:
for (p = q; p = p -> next; ++p) {
if (p -> data > 0) ++count;
}
Could get optimized into:
register int r1 = count;
for (p = q; p = p -> next; ++p) {
if (p -> data > 0) ++r1;
}
count = r1;
If all p->data are non-negative, the original source code did not write to count, but the optimized code does. This can introduce a data race in an otherwise race-free program, so the C++11 specification disallows such optimizations. Existing compilers now have to verify (and adjust if necessary) all optimizations.
See Concurrency memory model compiler consequences for details.

It's not so much that they don't support the memory model, but that they don't (yet) support the API in the Standard for interacting with the memory model. That API includes a number of mutexes.
However, both Clang and GCC have been as thread aware as possible without a formal standard for some time. You don't have to worry about optimizations moving things to the wrong side of atomic operations.

Thread safe lazy construction of a singleton in C++

Is there a way to implement a singleton object in C++ that is:
Lazily constructed in a thread safe manner (two threads might simultaneously be the first user of the singleton - it should still only be constructed once).
Doesn't rely on static variables being constructed beforehand (so the singleton object is itself safe to use during the construction of static variables).
(I don't know my C++ well enough, but is it the case that integral and constant static variables are initialized before any code is executed (ie, even before static constructors are executed - their values may already be "initialized" in the program image)? If so - perhaps this can be exploited to implement a singleton mutex - which can in turn be used to guard the creation of the real singleton..)
Excellent, it seems that I have a couple of good answers now (shame I can't mark 2 or 3 as being the answer). There appears to be two broad solutions:
Use static initialisation (as opposed to dynamic initialisation) of a POD static variable, and implementing my own mutex with that using the builtin atomic instructions. This was the type of solution I was hinting at in my question, and I believe I knew already.
Use some other library function like pthread_once or boost::call_once. These I certainly didn't know about - and am very grateful for the answers posted.

Basically, you're asking for synchronized creation of a singleton, without using any synchronization (previously-constructed variables). In general, no, this is not possible. You need something available for synchronization.
As for your other question, yes, static variables which can be statically initialized (i.e. no runtime code necessary) are guaranteed to be initialized before other code is executed. This makes it possible to use a statically-initialized mutex to synchronize creation of the singleton.
From the 2003 revision of the C++ standard:
Objects with static storage duration (3.7.1) shall be zero-initialized (8.5) before any other initialization takes place. Zero-initialization and initialization with a constant expression are collectively called static initialization; all other initialization is dynamic initialization. Objects of POD types (3.9) with static storage duration initialized with constant expressions (5.19) shall be initialized before any dynamic initialization takes place. Objects with static storage duration defined in namespace scope in the same translation unit and dynamically initialized shall be initialized in the order in which their definition appears in the translation unit.
If you know that you will be using this singleton during the initialization of other static objects, I think you'll find that synchronization is a non-issue. To the best of my knowledge, all major compilers initialize static objects in a single thread, so thread-safety during static initialization. You can declare your singleton pointer to be NULL, and then check to see if it's been initialized before you use it.
However, this assumes that you know that you'll use this singleton during static initialization. This is also not guaranteed by the standard, so if you want to be completely safe, use a statically-initialized mutex.
Edit: Chris's suggestion to use an atomic compare-and-swap would certainly work. If portability is not an issue (and creating additional temporary singletons is not a problem), then it is a slightly lower overhead solution.

Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model. (It is supported by the Java 1.5 and later — and I think .NET — memory model.) This means that between the time when the pObj == NULL check takes place and when the lock (mutex) is acquired, pObj may have already been assigned on another thread. Thread switching happens whenever the OS wants it to, not between "lines" of a program (which have no meaning post-compilation in most languages).
Furthermore, as Matt acknowledges, he uses an int as a lock rather than an OS primitive. Don't do that. Proper locks require the use of memory barrier instructions, potentially cache-line flushes, and so on; use your operating system's primitives for locking. This is especially important because the primitives used can change between the individual CPU lines that your operating system runs on; what works on a CPU Foo might not work on CPU Foo2. Most operating systems either natively support POSIX threads (pthreads) or offer them as a wrapper for the OS threading package, so it's often best to illustrate examples using them.
If your operating system offers appropriate primitives, and if you absolutely need it for performance, instead of doing this type of locking/initialization you can use an atomic compare and swap operation to initialize a shared global variable. Essentially, what you write will look like this:
MySingleton *MySingleton::GetSingleton() {
if (pObj == NULL) {
// create a temporary instance of the singleton
MySingleton *temp = new MySingleton();
if (OSAtomicCompareAndSwapPtrBarrier(NULL, temp, &pObj) == false) {
// if the swap didn't take place, delete the temporary instance
delete temp;
}
}
return pObj;
}
This only works if it's safe to create multiple instances of your singleton (one per thread that happens to invoke GetSingleton() simultaneously), and then throw extras away. The OSAtomicCompareAndSwapPtrBarrier function provided on Mac OS X — most operating systems provide a similar primitive — checks whether pObj is NULL and only actually sets it to temp to it if it is. This uses hardware support to really, literally only perform the swap once and tell whether it happened.
Another facility to leverage if your OS offers it that's in between these two extremes is pthread_once. This lets you set up a function that's run only once - basically by doing all of the locking/barrier/etc. trickery for you - no matter how many times it's invoked or on how many threads it's invoked.

Here's a very simple lazily constructed singleton getter:
Singleton *Singleton::self() {
static Singleton instance;
return &instance;
}
This is lazy, and the next C++ standard (C++0x) requires it to be thread safe. In fact, I believe that at least g++ implements this in a thread safe manner. So if that's your target compiler or if you use a compiler which also implements this in a thread safe manner (maybe newer Visual Studio compilers do? I don't know), then this might be all you need.
Also see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2513.html on this topic.

You can't do it without any static variables, however if you are willing to tolerate one, you can use Boost.Thread for this purpose. Read the "one-time initialisation" section for more info.
Then in your singleton accessor function, use boost::call_once to construct the object, and return it.

For gcc, this is rather easy:
LazyType* GetMyLazyGlobal() {
static const LazyType* instance = new LazyType();
return instance;
}
GCC will make sure that the initialization is atomic. For VC++, this is not the case. :-(
One major issue with this mechanism is the lack of testability: if you need to reset the LazyType to a new one between tests, or want to change the LazyType* to a MockLazyType*, you won't be able to. Given this, it's usually best to use a static mutex + static pointer.
Also, possibly an aside: It's best to always avoid static non-POD types. (Pointers to PODs are OK.) The reasons for this are many: as you mention, initialization order isn't defined -- neither is the order in which destructors are called though. Because of this, programs will end up crashing when they try to exit; often not a big deal, but sometimes a showstopper when the profiler you are trying to use requires a clean exit.

While this question has already been answered, I think there are some other points to mention:
If you want lazy-instantiation of the singleton while using a pointer to a dynamically allocated instance, you'll have to make sure you clean it up at the right point.
You could use Matt's solution, but you'd need to use a proper mutex/critical section for locking, and by checking "pObj == NULL" both before and after the lock. Of course, pObj would also have to be static ;)
.
A mutex would be unnecessarily heavy in this case, you'd be better going with a critical section.
But as already stated, you can't guarantee threadsafe lazy-initialisation without using at least one synchronisation primitive.
Edit: Yup Derek, you're right. My bad. :)

You could use Matt's solution, but you'd need to use a proper mutex/critical section for locking, and by checking "pObj == NULL" both before and after the lock. Of course, pObj would also have to be static ;) . A mutex would be unnecessarily heavy in this case, you'd be better going with a critical section.
OJ, that doesn't work. As Chris pointed out, that's double-check locking, which is not guaranteed to work in the current C++ standard. See: C++ and the Perils of Double-Checked Locking
Edit: No problem, OJ. It's really nice in languages where it does work. I expect it will work in C++0x (though I'm not certain), because it's such a convenient idiom.

read on weak memory model. It can break double-checked locks and spinlocks. Intel is strong memory model (yet), so on Intel it's easier
carefully use "volatile" to avoid caching of parts the object in registers, otherwise you'll have initialized the object pointer, but not the object itself, and the other thread will crash
the order of static variables initialization versus shared code loading is sometimes not trivial. I've seen cases when the code to destruct an object was already unloaded, so the program crashed on exit
such objects are hard to destroy properly
In general singletons are hard to do right and hard to debug. It's better to avoid them altogether.

I suppose saying don't do this because it's not safe and will probably break more often than just initializing this stuff in main() isn't going to be that popular.
(And yes, I know that suggesting that means you shouldn't attempt to do interesting stuff in constructors of global objects. That's the point.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js