Can threads be safely created during static initialization? - c++

At some point I remember reading that threads can't be safely created until the first line of main(), because compilers insert special code to make threading work that runs during static initialization time. So if you have a global object that creates a thread on construction, your program may crash. But now I can't find the original article, and I'm curious how strong a restriction this is -- is it strictly true by the standard? Is it true on most compilers? Will it remain true in C++0x? Is it possible for a standards conforming compiler to make static initialization itself multithreaded? (e.g. detecting that two global objects don't touch one another, and initializing them on separate threads to accelerate program startup)
Edit: To clarify, I'm trying to at least get a feel for whether implementations really differ significantly in this respect, or if it's something that's pseudo-standard. For example, technically the standard allows for shuffling the layout of members that belong to different access specifiers (public/protected/etc.). But no compiler I know of actually does this.

What you're talking about isn't strictly in the language but in the C Run Time Library (CRT).
For a start, if you create a thread using a native call such as CreateThread() on windows then you can do it anywhere you'd like because it goes straight to the OS with no intervention of the CRT.
The other option you usually have is to use _beginthread() which is part of the CRT. There are some advantages to using _beginthread() such as having a thread-safe errno. Read more about this here. If you're going to create threads using _beginthread() there could be some issues because the initializations needed for _beginthread() might not be in place.
This touches on a more general issue of what exactly happens before main() and in what order. Basically you have the program's entry point function which takes care of everything that needs to happen before main() with Visual Studio you can actually look at this piece of code which is in the CRT and find out for yourself what exactly's going on there. The easyest way to get to that code is to stop a breakpoint in your code and look at the stack frames before main()

The underlying problem is a Windows restriction on what you can and cannot do in DllMain. In particular, you're not supposed to create threads in DllMain. Static initialization often happens from DllMain. Then it follows logically that you can't create threads during static initialization.

As far as I can tell from reading the C++0x/1x draft, starting a thread prior to main() is fine, but still subject to the normal pitfalls of static initialization. A conforming implementation will have to make sure the code to intialize threading executes before any static or thread constructors do.

Related

Why is volatile keyword not needed for thread synchronisation?

I am reading that the volatile keyword is not suitable for thread synchronisation and in fact it is not needed for these purposes at all.
While I understand that using this keyword is not sufficient, I fail to understand why is it completely unnecessary.
For example, assume we have two threads, thread A that only reads from a shared variable and thread B that only writes to a shared variable. Proper synchronisation by e.g. pthreads mutexes is enforced.
IIUC, without the volatile keyword, the compiler may look at the code of thread A and say: “The variable doesn’t appear to be modified here, but we have lots of reads; let’s read it only once, cache the value and optimise away all subsequent reads.” Also it may look at the code of thread B and say: “We have lots of writes to this variable here, but no reads; so, the written values are not needed and thus let’s optimise away all writes.“
Both optimisations would be incorrect. And both one would be prevented by volatile. So, I would likely come to the conclusion that while volatile is not enough to synchronise threads, it is still necessary for any variable shared between threads. (note: I now read that actually it is not required for volatile to prevent write elisions; so I am out of ideas how to prevent such incorrect optimisations)
I understand that I am wrong in here. But why?
For example, assume we have two threads, thread A that only reads from a shared variable and thread B that only writes to a shared variable. Proper synchronisation by e.g. pthreads mutexes is enforced.
IIUC, without the volatile keyword, the compiler may look at the code of thread A and say: “The variable doesn’t appear to be modified here, but we have lots of reads; let’s read it only once, cache the value and optimise away all subsequent reads.” Also it may look at the code of thread B and say: “We have lots of writes to this variable here, but no reads; so, the written values are not needed and thus let’s optimise away all writes.“
Like most thread synchronization primitives, pthreads mutex operations have explicitly defined memory visibility semantics.
Either the platform supports pthreads or it doesn't. If it supports pthreads, it supports pthreads mutexes. Either those optimizations are safe or they aren't. If they're safe, there's no problem. If they're unsafe, then any platform that makes them doesn't support pthreads mutexes.
For example, you say "The variable doesn’t appear to be modified here", but it does -- another thread could modify it there. Unless the compiler can prove its optimization can't break any conforming program, it can't make it. And a conforming program can modify the variable in another thread. Either the compiler supports POSIX threads or it doesn't.
As it happens, most of this happens automatically on most platforms. The compiler is just prevented from having any idea what the mutex operations do internally. Anything another thread could do, the mutex operations themselves could do. So the compiler has to "synchronize" memory before entering and exiting those functions anyway. It can't, for example, keep a value in a register across the call to pthread_mutex_lock because for all it knows, pthread_mutex_lock accesses that value in memory. Alternatively, if the compiler has special knowledge about the mutex functions, that would include knowing about the invalidity of caching values accessible to other threads across those calls.
A platform that requires volatile would be pretty much unusable. You'd need versions of every function or class for the specific cases where an object might be made visible to, or was made visible from, another thread. In many cases, you'd pretty much just have to make everything volatile and not caching values in registers is a performance non-starter.
As you've probably heard many times, volatile's semantics as specified in the C language just do not mix usefully with threads. Not only is it not sufficient, it disables many perfectly safe and nearly essential optimizations.
Shortening the answer already given, you do not need to use volatile with mutexes for a simple reason:
If compiler knows what mutex operations are (by recognizing pthread_* functions or because you used std::mutex), it well knows how to handle access in regards to optimization (which is even required for std::mutex)
If compiler does not recognize them, pthread_* functions are completely opaque to it, and no optimizations involving any sort of non-local duration objects can go across opaque functions
Making the answer even shorter, not using either a mutex or a semaphore, is a bug. As soon as thread B releases the mutex (and thread A gets it), any value in the register which contains the shared variable's value from thread B are guaranteed to be written to cache or memory that will prevent a race condition when thread A runs and reads this variable.
The implementation to guarantee this is architecture/compiler dependent.
The keyword volatile tells the compiler to treat any write or read of the variable as an "observable side-effect." That is all it does. Observable side-effects of course must not be optimized away, and must appear to the outside world as occurring in the order the program indicates; The compiler may not re-order observable side-effects with regard to each other. The compiler is however free to reorder them with respect to non-observables. Therefore, volatile is only appropriate for accessing memory-mapped hardware, Unix-style signal handlers and the like. For inter-thread concurrence, use std::atomic or higher level synchronization objects like mutex, condition_variable, and promise/future.

Platform-Specific Workarounds to C++ Static Destruction / Construction Order Problems

I am developing with Visual Studio 2008 in standard (unmanaged) C++ under Windows XP Pro SP 3.
I have created a thread-safe wrapper around std::cout. This wrapper object is a drop-in replacement (i.e. same name) for what use to be a macro that was #defined to cout. It is used by a lot of code. Its behavior is probably pretty much as you would expect:
At construction, it creates a critical section.
During calls to operator<<(), it locks the critical section, passes the data to be printed to cout, and finally releases the critical section.
At destruction, it destroys the critical section.
This wrapper lives in static storage (it's global). Like all such objects, it constructs before main() starts and it destructs after main() exits.
Using my wrapper in the destructor of another object that also lives in static storage is problematic. Since the construction / destruction order of such objects is indeterminate, I may very well try to lock a critical section that has been destroyed. The symptom I'm seeing is that my program blocks on the lock attempt (though I suppose anything is liable to happen).
As far as ways to deal with this...
I could do nothing in the destructor; specifically, I would let the critical section continue to live. The C++ standard guarantees that cout never dies during program execution, and this would be the best possible attempt at making my wrapper behave similarly. Of course, my wrapper would "officially" be dead after its empty destructor runs, but it would probably (I hate that word) be as functional as it was before its destructor ran. On my platform, this does seem to be the case. But oh my gosh is this ugly, non-portable, and liable to future breakage...
I hold the critical section (but not the stream reference to cout) in a pimpl. All critical section accesses via the pimpl are preceded by a check for non-nullness of the pimpl. It so happens that I forgot to set the pimpl to 0 after calling delete on it in the destructor. If I were to set this to 0 (which I should do anyway), calls into my wrapper after it destructed would do nothing with the critical section but will still pass data to be printed to cout. On my platform, this also seems to work. Again, ugly...
I could tell my teammates to not use my wrapper after main() exits. Unfortunately, the aerodynamics of this would be about the same as that of a tank.
QUESTIONS:
* Question 1 *
For case 1, if I leave the critical section undestroyed, there will be a resource leak of a critical section in the OS. Will this leak persist after my program has fully exited? If not, case 1 becomes more viable.
* Question 2 *
For cases 1 and 2, does anybody know if on my particular platform I can indeed safely continue to use my wrapper after its empty destructor runs? It appears I can, but I want to see if anybody knows anything definitive about how my platform behaves in this case...
* Question 3 *
My proposals are obviously imperfect, but I do not see a truly correct solution. Does anybody out there know of a correct solution to this problem?
Side note: Of course, there is a converse problem that could occur if I try to use my wrapper in the constructor of another object that also lives in static storage. In that case, I may try to lock a critical section that has not yet been created. I would like to use the "construct on first use" idiom to fix this, but that entails a syntactic change of all the code that uses my wrapper. This would require giving up the naturalness of using the << operator. And there's way too much code to change anyway. So, this is not a workable option. I'm not very far into the thought process on this half of the problem, but I will ask one question that might be part of another imperfect way to deal with the problem...
* Question 4 *
As I've said, my wrapper lives in static storage (it's global) and it has a pimpl (hormonal problem :) ). I have the impression that the raw bytes of a variable in static storage are set to 0 at load time (unless initialized differently in code). This would mean that my wrapper's pimpl has a value of 0 before construction of my wrapper. Is this correct?
Thank You,
Dave
The first thing is that I would reconsider what you are doing altogether. You cannot create a thread safe interface by merely adding locking to each one of the operations. Thread safety must be designed into the interface. The problem with a drop in replacement as the one you propose is that it will make each single operation thread safe (I think they already are) but that does not avoid unwanted interleaving.
Consider two threads that executed cout << "Hi" << endl;, locking each operation does not exclude "HiHi\n\n" as output, and things get much ore complicated with manipulators, where one thread might change the format for the next value to be printed, but another thread might trigger the next write, in which case the two formats will be wrong.
On the particular question that you ask, You can consider using the same approach that the standard library takes with the iostreams:
Instead of creating the objects as globals, create a helper type that performs reference counting on the number of instances of the type. The constructor would check if the object is the first of its type to be created and initialize the thread safe wrapper. The last object to be destroyed would destroy your wrapper. The next piece of the puzzle is creating a global static variable of that type in a header that in turn includes the iostreams header. The last piece of the puzzle is that your users should include your header instead of iostreams.

Are mutex lock functions sufficient without volatile?

A coworker and I write software for a variety of platforms running on x86, x64, Itanium, PowerPC, and other 10 year old server CPUs.
We just had a discussion about whether mutex functions such as pthread_mutex_lock() ... pthread_mutex_unlock() are sufficient by themselves, or whether the protected variable needs to be volatile.
int foo::bar()
{
//...
//code which may or may not access _protected.
pthread_mutex_lock(m);
int ret = _protected;
pthread_mutex_unlock(m);
return ret;
}
My concern is caching. Could the compiler place a copy of _protected on the stack or in a register, and use that stale value in the assignment? If not, what prevents that from happening? Are variations of this pattern vulnerable?
I presume that the compiler doesn't actually understand that pthread_mutex_lock() is a special function, so are we just protected by sequence points?
Thanks greatly.
Update: Alright, I can see a trend with answers explaining why volatile is bad. I respect those answers, but articles on that subject are easy to find online. What I can't find online, and the reason I'm asking this question, is how I'm protected without volatile. If the above code is correct, how is it invulnerable to caching issues?
Simplest answer is volatile is not needed for multi-threading at all.
The long answer is that sequence points like critical sections are platform dependent as is whatever threading solution you're using so most of your thread safety is also platform dependent.
C++0x has a concept of threads and thread safety but the current standard does not and therefore volatile is sometimes misidentified as something to prevent reordering of operations and memory access for multi-threading programming when it was never intended and can't be reliably used that way.
The only thing volatile should be used for in C++ is to allow access to memory mapped devices, allow uses of variables between setjmp and longjmp, and to allow uses of sig_atomic_t variables in signal handlers. The keyword itself does not make a variable atomic.
Good news in C++0x we will have the STL construct std::atomic which can be used to guarantee atomic operations and thread safe constructs for variables. Until your compiler of choice supports it you may need to turn to the boost library or bust out some assembly code to create your own objects to provide atomic variables.
P.S. A lot of the confusion is caused by Java and .NET actually enforcing multi-threaded semantics with the keyword volatile C++ however follows suit with C where this is not the case.
Your threading library should include the apropriate CPU and compiler barriers on mutex lock and unlock. For GCC, a memory clobber on an asm statement acts as a compiler barrier.
Actually, there are two things that protect your code from (compiler) caching:
You are calling a non-pure external function (pthread_mutex_*()), which means that the compiler doesn't know that that function doesn't modify your global variables, so it has to reload them.
As I said, pthread_mutex_*() includes a compiler barrier, e.g: on glibc/x86 pthread_mutex_lock() ends up calling the macro lll_lock(), which has a memory clobber, forcing the compiler to reload variables.
If the above code is correct, how is it invulnerable to caching
issues?
Until C++0x, it is not. And it is not specified in C. So, it really depends on the compiler. In general, if the compiler does not guarantee that it will respect ordering constraints on memory accesses for functions or operations that involve multiple threads, you will not be able to write multithreaded safe code with that compiler. See Hans J Boehm's Threads Cannot be Implemented as a Library.
As for what abstractions your compiler should support for thread safe code, the wikipedia entry on Memory Barriers is a pretty good starting point.
(As for why people suggested volatile, some compilers treat volatile as a memory barrier for the compiler. It's definitely not standard.)
The volatile keyword is a hint to the compiler that the variable might change outside of program logic, such as a memory-mapped hardware register that could change as part of an interrupt service routine. This prevents the compiler from assuming a cached value is always correct and would normally force a memory read to retrieve the value. This usage pre-dates threading by a couple decades or so. I've seen it used with variables manipulated by signals as well, but I'm not sure that usage was correct.
Variables guarded by mutexes are guaranteed to be correct when read or written by different threads. The threading API is required to ensure that such views of variables are consistent. This access is all part of your program logic and the volatile keyword is irrelevant here.
With the exception of the simplest spin lock algorithm, mutex code is quite involved: a good optimized mutex lock/unlock code contains the kind of code even excellent programmer struggle to understand. It uses special compare and set instructions, manages not only the unlocked/locked state but also the wait queue, optionally uses system calls to go into a wait state (for lock) or wake up other threads (for unlock).
There is no way the average compiler can decode and "understand" all that complex code (again, with the exception of the simple spin lock) no matter way, so even for a compiler not aware of what a mutex is, and how it relates to synchronization, there is no way in practice a compiler could optimize anything around such code.
That's if the code was "inline", or available for analyse for the purpose of cross module optimization, or if global optimization is available.
I presume that the compiler doesn't actually understand that
pthread_mutex_lock() is a special function, so are we just protected
by sequence points?
The compiler does not know what it does, so does not try to optimize around it.
How is it "special"? It's opaque and treated as such. It is not special among opaque functions.
There is no semantic difference with an arbitrary opaque function that can access any other object.
My concern is caching. Could the compiler place a copy of _protected
on the stack or in a register, and use that stale value in the
assignment?
Yes, in code that act on objects transparently and directly, by using the variable name or pointers in a way that the compiler can follow. Not in code that might use arbitrary pointers to indirectly use variables.
So yes between calls to opaque functions. Not across.
And also for variables which can only be used in the function, by name: for local variables that don't have either their address taken or a reference bound to them (such that the compiler cannot follow all further uses). These can indeed be "cached" across arbitrary calls include lock/unlock.
If not, what prevents that from happening? Are variations of this
pattern vulnerable?
Opacity of the functions. Non inlining. Assembly code. System calls. Code complexity. Everything that make compilers bail out and think "that's complicated stuff just make calls to it".
The default position of a compiler is always the "let's execute stupidly I don't understand what is being done anyway" not "I will optimize that/let's rewrite the algorithm I know better". Most code is not optimized in complex non local way.
Now let's assume the absolute worse (from out point of view which is that the compiler should give up, that is the absolute best from the point of view of an optimizing algorithm):
the function is "inline" (= available for inlining) (or global optimization kicks in, or all functions are morally "inline");
no memory barrier is needed (as in a mono-processor time sharing system, and in a multi-processor strongly ordered system) in that synchronization primitive (lock or unlock) so it contains no such thing;
there is no special instruction (like compare and set) used (for example for a spin lock, the unlock operation is a simple write);
there is no system call to pause or wake threads (not needed in a spin lock);
then we might have a problem as the compiler could optimize around the function call. This is fixed trivially by inserting a compiler barrier such as an empty asm statement with a "clobber" for other accessible variables. That means that compiler just assumes that anything that might be accessible to a called function is "clobbered".
or whether the protected variable needs to be volatile.
You can make it volatile for the usual reason you make things volatile: to be certain to be able to access the variable in the debugger, to prevent a floating point variable from having the wrong datatype at runtime, etc.
Making it volatile would actually not even fix the issue described above as volatile is essentially a memory operation in the abstract machine that has the semantics of an I/O operation and as such is only ordered with respect to
real I/O like iostream
system calls
other volatile operations
asm memory clobbers (but then no memory side effect is reordered around those)
calls to external functions (as they might do one the above)
Volatile is not ordered with respect to non volatile memory side effects. That makes volatile practically useless (useless for practical uses) for writing thread safe code in even the most specific case where volatile would a priori help, the case where no memory fence is ever needed: when programming threading primitives on a time sharing system on a single CPU. (That may be one of the least understood aspects of either C or C++.)
So while volatile does prevent "caching", volatile doesn't even prevent compiler reordering of lock/unlock operation unless all shared variables are volatile.
Locks/synchronisation primitives make sure the data is not cached in registers/cpu cache, that means data propagates to memory. If two threads are accessing/ modifying data with in locks, it is guaranteed that data is read from memory and written to memory. We don't need volatile in this use case.
But the case where you have code with double checks, compiler can optimise the code and remove redundant code, to prevent that we need volatile.
Example: see singleton pattern example
https://en.m.wikipedia.org/wiki/Singleton_pattern#Lazy_initialization
Why do some one write this kind of code?
Ans: There is a performance benefit of not accuiring lock.
PS: This is my first post on stack overflow.
Not if the object you're locking is volatile, eg: if the value it represents depends on something foreign to the program (hardware state).
volatile should NOT be used to denote any kind of behavior that is the result of executing the program.
If it's actually volatile what I personally would do is locking the value of the pointer/address, instead of the underlying object.
eg:
volatile int i = 0;
// ... Later in a thread
// ... Code that may not access anything without a lock
std::uintptr_t ptr_to_lock = &i;
some_lock(ptr_to_lock);
// use i
release_some_lock(ptr_to_lock);
Please note that it only works if ALL the code ever using the object in a thread locks the same address. So be mindful of that when using threads with some variable that is part of an API.

Context of Main function in C or C++

Does the main function we define in C or C++ run in a process or thread.
If it runs in a thread, which process is responsible for spawning it
main() is the entry point for your program. C++ (current C++ anyway) doesn't know what a process or thread is. The word 'process' is not even in the index of the standard. What happens before and after main() is mostly implementation defined. So, the answer to your question is also implementation defined.
In general though most operating systems have the concept of process and thread and they have similar meanings (though in Linux, for example, a thread is actually a "light weight process"). You can generally assume that your program will be started in a new process and that main() will then be called by the original thread after the implementation defined initialization.
Since there's plenty of room for the implementation and/or you to start up a whole bunch of threads before main is called though you will probably generally want to consider main() to have been called during the execution of a thread. The best way to think about it though is probably in terms of the standard unless you really have to think about the implementation. The standard doesn't currently know what a process or thread is. C++0x will change that in some way but I'm not sure at this point what the new concepts will be or how they will relate to OS specific constructs.
My answer is specifically addressed at the C++ language part of the question. C is a different language and I haven't used it in a good 10 years so I forget how the globals initialization is specified.
It's a process that you spawn when you execute your program. The main function is called at the beginning of the program. It is all a part of the same program (i.e. one process).
When you ask your OS to start a new process, it initializes data structures for a process and for a single thread inside that process. The initial instruction pointer in that thread context is the process entry point, which is a function provided by your C runtime library. That library-provided entry point converts the environment table and command-line arguments into the format demanded by the C standard, and then calls your main function.
Your whole program is a single process unless it starts fork()ing things, and by default the process has one thread that does everything; main() starts on that thread

Thread-safe static variables without mutexing?

I remember reading that static variables declared inside methods is not thread-safe. (See What about the Meyer's singleton? as mentioned by Todd Gardner)
Dog* MyClass::BadMethod()
{
static Dog dog("Lassie");
return &dog;
}
My library generates C++ code for end-users to compile as part of their application. The code it generates needs to initialize static variables in a thread-safe cross-platform manner. I'd like to use boost::call_once to mutex the variable initialization but then end-users are exposed to the Boost dependency.
Is there a way for me to do this without forcing extra dependencies on end-users?
You are correct that static initialization like that isn't thread safe (here is an article discussing what the compiler will turn it into)
At the moment, there's no standard, thread safe, portable way to initialize static singletons. Double checked locking can be used, but you need potentially non-portable threading libraries (see a discussion here).
Here's a few options if thread safety is a must:
Don't be Lazy (loaded): Initialize during static initialization. It could be a problem if another static calls this function in it's constructor, since the order of static initialization is undefined(see here).
Use boost (as you said) or Loki
Roll your
own singleton on your supported platforms
(should probably be avoided unless
you are a threading expert)
Lock a mutex everytime you need access. This could be very slow.
Example for 1:
// in a cpp:
namespace {
Dog dog("Lassie");
}
Dog* MyClass::BadMethod()
{
return &dog;
}
Example for 4:
Dog* MyClass::BadMethod()
{
static scoped_ptr<Dog> pdog;
{
Lock l(Mutex);
if(!pdog.get())
pdog.reset(new Dog("Lassie"));
}
return pdog.get();
}
Not sure whether this is what you mean or not, but you can remove the boost dependency on POSIX systems by calling pthread_once instead. I guess you'd have to do something different on Windows, but avoiding that is exactly why boost has a thread library in the first place, and why people pay the price of depending on it.
Doing anything "thread-safely" is inherently bound up with your threads implementation. You have to depend on something, even if it's only the platform-dependent memory model. It is simply not possible in pure C++03 to assume anything at all about threads, which are outside the scope of the language.
One way you could do it that does not require a mutex for thread safety is to make the singleton a file static, rather than function static:
static Dog dog("Lassie");
Dog* MyClass::BadMethod()
{
return &dog;
}
The Dog instance will be initialised before the main thread runs. File static variables have a famous issue with the initialisation order, but as long as the Dog does not rely on any other statics defined in another translation unit, this should not be of concern.
The only way I know of to guarantee you won't have threading issues with non-protected resources like your "static Dog" is to make it a requirement that they're all instantiated before any threads are created.
This could be as simple as just documenting that they have to call a MyInit() function in the main thread before doing anything else. Then you construct MyInit() to instantiate and destroy one object of each type that contains one of those statics.
The only other alternative is to put another restriction on how they can use your generated code (use Boost, Win32 threads, etc). Either of those solutions are acceptable in my opinion - it's okay to generate rules that they must follow.
If they don't follow the rules as set out by your documentation, then all bets are off. The rule that they must call an initialization function or be dependent on Boost is not unreasonable to me.
AFAIK, the only time this has been done safely and without mutexes or prior initialisation of global instances is in Matthew Wilson's Imperfect C++, which discusses how to do this using a "spin mutex". I'm not near to my copy of it, so can't tell you any more precisely at this time.
IIRC, there are some examples of the use of this inside the STLSoft libraries, though I can't remember which components at this time.