Everything is volatile - c++

I'm creating this multithreaded C++ program and upon compiling in Release mode, I'm finding bugs of the sort (object still null) ie, it looks like missing volatile markers.
But the problem is, since there is a 2nd worker thread touching all kinds of objects, it means that virtually everything is volatile in the program.
I'm wondering if there is a way to turn off optimizations in the Apple LLVM compiler that create the bugs the volatile keyword was specifically designed to fix. These bugs don't show up in debug mode (because optimizations are off). Putting volatile everywhere basically means peppering every class with volatile after every member function, and adding volatile before every shared variable declaration.
I think I'd rather lose that volatile optimization than risk a spurious bug showing up because I forgot to mark something volatile.

In C++, volatile has nothing to do with thread safety. You cannot rely on it to avoid data races. Its purpose is to force synchronised accesses to a variable (from a single thread, or threads that use some other mechanism to synchronise with each other) to happen exactly in the order specified. This is often necessary when interacting with hardware, to prevent accesses that appear to do nothing, but actually affect the state of the hardware, from being optimised away. It gives no guarantees about the effect of unsynchronised accesses.
To avoid data races, you must use either atomic operations or explicit locks to synchronise access to shared objects. C++11 provides these in the standard library; if you're stuck in the past, then you'll have to rely on whatever libraries (such as pthreads) or language extensions (such as atomic intrinsics) are available on your platform.

Related

How are MT programs proven correct with "non sequential" semantics?

This could be a language neutral question, but in practice I'm interested with the C++ case: how are multithread programs written in C++ versions that support MT programming, that is modern C++ with a memory model, ever proven correct?
In old C++, MT programs were just written in term of pthread semantics and validated in term of the pthread rules which was conceptually easy: use primitives correctly and avoid data races.
Now, the C++ language semantic is defined in term of a memory model and not in term of sequential execution of primitive steps. (Also the standard mentions an "abstract machine" but I don't understand any more what it means.)
How are C++ programs proven correct with that non sequential semantic? How can anyone reason about a program that is not doing primitive steps one after the other?
It is "conceptually easier" with the C++ memory model than it was with pthreads prior to the C++ memory model. C++ prior to the memory model interacting with pthreads was loosely specified, and reasonable interpretations of the specification permitted the compiler to "introduce" data races, so it is extremely difficult (if possible at all) to reason about or prove correctness for MT algorithms in the context of older C++ with pthreads.
There seems to be a fundamental misunderstanding in the question in that C++ was never defined as a sequential execution of primitive steps. It has always been the case that there is a partial ordering between expression evaluations. And the compiler is allowed to move such expressions around unless constrained from doing so. This was unchanged by the introduction of the memory model. The memory model introduced a partial order for evaluations between separate threads of execution.
The advice "use the primitives correctly and avoid data races" still applies, but the C++ memory model more strictly and precisely constrains the interaction between the primitives and the rest of the language, allowing more precise reasoning.
In practice, it is not easy to prove correctness in either context. Most programs are not proven to be data race free. One tries to encapsulate as much as possible any synchronization so as to allow reasoning about smaller components, some of which can be proven correct. And one uses tools such as address sanitizer and thread sanitizer to catch data races.
On data races, POSIX says:
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads.... Applications may allow more than one thread of control to read a memory location simultaneously.
On data races, C++ says:
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
C++ defines more terms and tries to be more precise. The gist of this is that both forbid data races, which in both, are defined as conflicting accesses, without the use of the synchronization primitives.
POSIX says that the pthread functions synchronize memory with respect to other threads. That is underspecified. One could reasonably interpret that as (1) the compiler cannot move memory accesses across such a function call, and (2) after calling such a function in one thread, the prior actions to memory from that thread will be visible to another thread after it calls such a function. This was a common interpretation, and this is easily accomplished by treating the functions as opaque and potentially clobbering all of memory.
As an example of problems with this loose specification, the compiler is still allowed to introduce or remove memory accesses (e.g. through register promotion and spillage) and to make larger accesses than necessary (e.g. touching adjacent fields in a struct). Therefore, the compiler completely correctly could "introduce" data races that weren't written in the source code directly. The C++11 memory model stops it from doing so.
C++ says, with regard to mutex lock:
Synchronization: Prior unlock() operations on the same object shall synchronize with this operation.
So C++ is a little more specific. You have to lock and unlock the same mutex to have synchronization. But given this, C++ says that operations before the unlock are visible to the new locker:
An evaluation A strongly happens before an evaluation D if ... there are evaluations B and C such that A is sequenced before B, B simply happens before C, and C is sequenced before D. [ Note: Informally, if A strongly happens before B, then A appears to be evaluated before B in all contexts. Strongly happens before excludes consume operations. — end note ]
(With B = unlock, C = lock, B simply happens before C because B synchronizes with C. Sequenced before is a concept in a single thread of execution, so for example, one full expression is sequenced before the next.)
So, if you restrict yourself to the sorts of primitives (locks, condition variables, ...) that exist in pthread, and to the type of guarantees provided by pthread (sequential consistency), C++ should add no surprises. In fact, it removes some surprises, adds precision, and is more amenable to correctness proofs.
The article Foundations of the C++ Concurrency Memory Model is a great, expository read for anyone interested in this topic about the problems with the status quo at the time and the choices made to fix them in the C++11 memory model.
Edited to more clearly state that the premise of the question is flawed, that reasoning is easier with the memory model, and add a reference to the Boehm paper, which also shaped some of the exposition.

How compiler like GCC implement acquire/release semantics for std::mutex

My understanding is that std::mutex lock and unlock have a acquire/release semantics which will prevent instructions between them from being moved outside.
So acquire/release should disable both compiler and CPU reorder instructions.
My question is that I take a look at GCC5.1 code base and don't see anything special in std::mutex::lock/unlock to prevent compiler reordering codes.
I find a potential answer in does-pthread-mutex-lock-have-happens-before-semantics which indicates a mail that says a external function call act as compiler memory fences.
Is it always true? And where is the standard?
Threads are a fairly complicated, low-level feature. Historically, there was no standard C thread functionality, and instead it was done differently on different OS's. Today there is mainly the POSIX threads standard, which has been implemented in Linux and BSD, and now by extension OS X, and there are Windows threads, starting with Win32 and on. Potentially, there could be other systems besides these.
GCC doesn't directly contain a POSIX threads implementation, instead it may be a client of libpthread on a linux system. When you build GCC from source, you have to configure and build separately a number of ancillary libraries, supporting things like big numbers and threads. That is the point at which you select how threading will be done. If you do it the standard way on linux, you will have an implementation of std::thread in terms of pthreads.
On windows, starting with MSVC C++11 compliance, the MSVC devs implemented std::thread in terms of the native windows threads interface.
It's the OS's job to ensure that the concurrency locks provided by their API actually works -- std::thread is meant to be a cross-platform interface to such a primitive.
The situation may be more complicated for more exotic platforms / cross-compiling etc. For instance, in MinGW project (gcc for windows) -- historically, you have the option to build MinGW gcc using either a port of pthreads to windows, or using a native win32 based threading model. If you don't configure this when you build, you may end up with a C++11 compiler which doesn't support std::thread or std::mutex. See this question for more details. MinGW error: ‘thread’ is not a member of ‘std’
Now, to answer your question more directly. When a mutex is engaged, at the lowest level, this involves some call into libpthreads or some win32 API.
pthread_lock_mutex();
do_some_stuff();
pthread_unlock_mutex();
(The pthread_lock_mutex and pthread_unlock_mutex correspond to the implementations of lock and unlock of std::mutex on your platform, and in idiomatic C++11 code, these are in turn called in the ctor and dtor of std::unique_lock for instance if you are using that.)
Generally, the optimizer cannot reorder these unless it is sure that pthread_lock_mutex() has no side-effects that can change the observable behavior of do_some_stuff().
To my knowledge, the mechanism the compiler has for doing this is ultimately the same as what it uses for estimating the potential side-effects of calls to any other external library.
If there is some resource
int resource;
which is in contention among various threads, it means that there is some function body
void compete_for_resource();
and a function pointer to this is at some earlier point passed to pthread_create... in your program in order to initiate another thread. (This would presumably be in the implementation of the ctor of std::thread.) At this point, the compiler can see that any call into libpthread can potentially call compete_for_resource and touch any memory that that function touches. (From the compiler's point of view libpthread is a black box -- it is some .dll / .so and it can't make assumptions about what exactly it does.)
In particular, the call pthread_lock_mutex(); potentially has side-effects for resource, so it cannot be re-ordered against do_some_stuff().
If you never actually spawn any other threads, then to my knowledge, do_some_stuff(); could be reordered outside of the mutex lock. Since, then libpthread doesn't have any access to resource, it's just a private variable in your source and isn't shared with the external library even indirectly, and the compiler can see that.
All of these questions stem from the rules for compiler reordering. One of the fundamental rules for reordering is that the compiler must prove that the reorder does not change the result of the program. In the case of std::mutex, the exact meaning of that phrase is specified in a block of about 10 pages of legaleese, but the general intuitive sense of "doesn't change the result of the program" holds. If you had a guarantee about which operation came first, according to the specification, no compiler is allowed to reorder in a way which violates that guarantee.
This is why people often claim that a "function call acts as a memory barrier." If the compiler cannot deep-inspect the function, it cannot prove that the function didn't have a hidden barrier or atomic operation inside of it, thus it must treat that function as though it was a barrier.
There is, of course, the case where the compiler can inspect the function, such as the case of inline functions or link time optimizations. In these cases, one cannot rely on a function call to act as a barrier, because the compiler may indeed have enough information to prove the rewrite behaves the same as the original.
In the case of mutexes, even such advanced optimization cannot take place. The only way to reorder around the mutex lock/unlock function calls is to have deep-inspected the functions and proven there are no barriers or atomic operations to deal with. If it can't inspect every sub-call and sub-sub-call of that lock/unlock function, it can't prove it is safe to reorder. If it indeed can do this inspection, it would see that every mutex implementation contains something which cannot be reordered around (indeed, this is part of the definition of a valid mutex implementation). Thus, even in that extreme case, the compiler is still forbidden from optimizing.
EDIT: For completeness, I would like to point out that these rules were introduced in C++11. C++98 and C++03 reordering rules only prohibited changes that affected the result of the current thread. Such a guarantee is not strong enough to develop multithreading primitives like mutexes.
To deal with this, multithreading APIs like pthreads developed their own rules. from the Pthreads specification section 4.11:
Applications shall ensure that access to any memory location by more
than one thread of control (threads or processes) is restricted such
that no thread of control can read or modify a memory location while
another thread of control may be modifying it. Such access is
restricted using functions that synchronize thread execution and also
synchronize memory with respect to other threads. The following
functions synchronize memory with respect to other threads
It then lists a few dozen functions which synchronize memory, including pthread_mutex_lock and pthread_mutex_unlock.
A compiler which wishes to support the pthreads library must implement something to support this cross-thread memory synchronization, even though the C++ specification didn't say anything about it. Fortunately, any compiler where you want to do multithreading was developed with the recognition that such guarantees are fundamental to all multithreading, so every compiler that supports multithreading has it!
In the case of gcc, it did so without any special notes on the pthreads function calls because gcc would effectively create a barrier around every external function call (because it couldn't prove that no synchronization existed inside that function call). If gcc were to ever change that, they would also have to change their pthreads headers to include any extra verbage needed to mark the pthreads functions as synchronizing memory.
All of that, of course, is compiler specific. There were no standards answers to this question until C++11 came along with its new memory model.
NOTE: I am no expert in this area and my knowledge about it is in a spaghetti like condition. So take the answer with a grain of salt.
NOTE-2: This might not be the answer that OP is expecting. But here are my 2 cents anyways if it helps:
My question is that I take a look at GCC5.1 code base and don't see
anything special in std::mutex::lock/unlock to prevent compiler
reordering codes.
g++ using pthread library. std::mutex is just a thin wrapper around pthread_mutex. So, you will have to actually go and have a look at pthread's mutex implementation.
If you go bit deeper into the pthread implementation (which you can find here), you will see that it uses atomic instructions along with futex calls.
Two minor things to remember here:
1. The atomic instructions do use barriers.
2. Any function call is equivalent to full barrier. Do not remember from where I read it.
3. mutex calls may put the thread to sleep and cause context switch.
Now, as far as reordering goes, one of the things that needs to be guaranteed is that, no instruction after lock and before unlock should be reordered to before lock or after unlock. This I believe is not a full-barrier, but rather just acquire and release barrier respectively. But, this is again platform dependent, x86 provides sequential consistency by default whereas ARM provides a weaker ordering guarantee.
I strongly recommend this blog series:
http://preshing.com/archives/
It explains lots of lower level stuff in easy to understand language. Guess, I have to read it once again :)
UPDATE:: Unable to comment on #Cort Ammons answer due to length
#Kane I am not sure about this, but people in general write barriers for processor level which takes care of compiler level barriers as well. The same is not true for compiler builtin barriers.
Now, since the pthread_*lock* functions definitions are not present in the translation unit where you are making use of it (this is doubtful), calling lock - unlock should provide you with full memory barrier. The pthread implementation for the platform makes use of atomic instructions to block any other thread from accessing the memory locations after the lock or before unlock. Now since only one thread is executing the critical portion of the code it is ensured that any reordering within that will not change the expected behaviour as mentioned in above comment.
Atomics is pretty tough to understand and to get right, so, what I have written above is from my understanding. Would be very glad to know if my understanding is wrong here.
So acquire/release should disable both compiler and CPU reorder instructions.
By definition anything that prevents CPU reordering by speculative execution prevents compiler reordering. That's the definition of language semantics, even without MT (multi-threading) in the language, so you will be safe from reordering on old compilers that don't support MT.
But these compilers aren't safe for MT for a bunch of reasons, from the lack of thread protection around runtime initialization of static variables to the implicitly modified global variables like errno, etc.
Also, in C/C++, any call to a function that is purely external (that is: not inline, available for inlining at any point), without annotation explaining what it does (like the "pure function" attribute of some popular compiler), must be assumed to do anything that legal C/C++ code can do. No non trivial reordering would be possible (any reordering that is visible is non trivial).
Any correct implementation of locks on systems with multiple units of execution that don't simulate a global order on assembly instructions will require memory barriers and will prevent reordering.
An implementation of locks on a linearly executing CPU, with only one unit of execution (or where all threads are bound on the same unit of execution), might use only volatile variables for synchronisation and that is unsafe as volatile reads resp. writes do not provide any guarantee of acquire resp. release of any other data (contrast Java). Some kind of compiler barrier would be needed, like a strongly external function call, or some asm (""/*nothing*/) (which is compiler specific and even compiler version specific).

Way to do cross-platform interlocked options that also allows me to optionally bypass the synchronization?

With std::atomic, there seems to be no standards-compliant way to sometimes read/write without atomicity. Boost has interlocked operations, but they are in the details namespace so I don't think I'm supposed to use it. But I don't know all of boost. Is there something in boost or stl that I could use? Or perhaps a proposal which would address this, like maybe adding a std::memory_order_no_synchronization? Or access an abstraction of the interlocked machinery?
It seems like the need for this would appear in many designs. Objects that require thread-safety might be held temporarily in a single-threaded context, making atomic access superfluous. For example, when first constructing an object, it would usually be visible only on the thread creating it, until placed somewhere accessible to multiple threads. So long as your object is only reachable by a single thread, you wouldn't need std::atomic's safety at all. But then once ready, you'd publish it to other threads, and suddenly you need locking and enforced atomic access.
In this particular application, I'm building large lockless trees. During construction, there is just zero need for interlocked, so the existing design (which calls os-provided interlocked functions) does not use interlocked until it becomes necessary. After it becomes visible to other threads, all threads should be using an interlocked view. I can't port to std::atomic without introducing a bunch of pointless synchronization.
The best I can even think of now is to use std::memory_order_relaxed, but on ARM, this is still not the same as non-atomic access. On x86/amd64 it is though. Another hack is placement new on std::atomic, which writes a new value without atomicity, but that doesn't provide any way to read back the value non-atomically.

C++11 : Atomic variable : lock_free property : What does it mean?

I wanted to understand what does one mean by lock_free property of atomic variables in c++11. I did googled out and saw the other relevant questions on this forum but still having partial understanding. Appreciate if someone can explain it end-to-end and in simple way.
It's probably easiest to start by talking about what would happen if it was not lock-free.
The most obvious way to handle most atomic tasks is by locking. For example, to ensure that only one thread writes to a variable at a time, you can protect it with a mutex. Any code that's going to write to the variable needs obtain the mutex before doing the write (and release it afterwards). Only one thread can own the mutex at a time, so as long as all the threads follow the protocol, no more than one can write at any given time.
If you're not careful, however, this can be open to deadlock. For example, let's assume you need to write to two different variables (each protected by a mutex) as an atomic operation -- i.e., you need to ensure that when you write to one, you also write to the other). In such a case, if you aren't careful, you can cause a deadlock. For example, let's call the two mutexes A and B. Thread 1 obtains mutex A, then tries to get mutex B. At the same time, thread 2 obtains mutex B, and then tries to get mutex A. Since each is holding one mutex, neither can obtain both, and neither can progress toward its goal.
There are various strategies to avoid them (e.g., all threads always try to obtain the mutexes in the same order, or upon failure to obtain a mutex within a reasonable period of time, each thread releases the mutex it holds, waits some random amount of time, and then tries again).
With lock-free programming, however, we (obviously enough) don't use locks. This means that a deadlock like above simply cannot happen. When done properly, you can guarantee that all threads continuously progress toward their goal. Contrary to popular belief, it does not mean the code will necessarily run any faster than well written code using locks. It does, however, mean that deadlocks like the above (and some other types of problems like livelocks and some types of race conditions) are eliminated.
Now, as to exactly how you do that: the answer is short and simple: it varies -- widely. In a lot of cases, you're looking at specific hardware support for doing certain specific operations atomically. Your code either uses those directly, or extends them to give higher level operations that are still atomic and lock free. It's even possible (though only rarely practical) to implement lock-free atomic operations without hardware support (but given its impracticality, I'll pass on trying to go into more detail about it, at least for now).
Jerry already mentioned common correctness problems with locks, i.e. they're hard to understand and program correctly.
Another danger with locks is that you lose determinism regarding your execution time: if a thread that has acquired a lock gets delayed (e.g. descheduled by the operating system, or "swapped out"), then it is pos­si­ble that the entire program is de­layed be­cause it is waiting for the lock. By contrast, a lock-free al­go­rithm is al­ways gua­ran­teed to make some progress, even if any number of threads are held up some­where else.
By and large, lock-free programming is often slower (sometimes significantly so) than locked pro­gram­ming using non-atomic operations, because atomic operations cause a sig­ni­fi­cant hit on caching and pipelining; however, it offers determinism and upper bounds on latency (at least overall latency of your process; as #J99 observed, individual threads may still be starved as long as enough other threads are making progress). Your program may get a lot slower, but it never locks up entirely and always makes some progress.
The very nature of hardware architectures allows for certain small operations to be intrinsically atomic. In fact, this is a very necessity of any hardware that supports multitasking and multithreading. At the very heart of any synchronisation primitive, such as a mutex, you need some sort of atomic instruction that guarantees correct locking behaviour.
So, with that in mind, we now know that it is possible for certain types like booleans and machine-sized integers to be loaded, stored and exchanged atomically. Thus when we wrap such a type into an std::atomic template, we can expect that the resulting data type will indeed offer load, store and exchange operations that do not use locks. By contrast, your library implementation is always allowed to implement an atomic Foo as an ordinary Foo guarded by a lock.
To test whether an atomic object is lock-free, you can use the is_lock_free member function. Additionally, there are ATOMIC_*_LOCK_FREE macros that tell you whether atomic primitive types potentially have a lock-free instantiation. If you are writing concurrent algorithms that you want to be lock-free, you should thus include an assertion that your atomic objects are indeed lock-free, or a static assertion on the macro to have value 2 (meaning that every object of the corresponding type is always lock-free).
Lock-free is one of the non-blocking techniques. For an algorithm, it involves a global progress property: whenever a thread of the program is active, it can make a forward step in its action, for itself or eventually for the other.
Lock-free algorithms are supposed to have a better behavior under heavy contentions where threads acting on a shared resources may spent a lot of time waiting for their next active time slice. They are also almost mandatory in context where you can't lock, like interrupt handlers.
Implementations of lock-free algorithms almost always rely on Compare-and-Swap (some may used things like ll/sc) and strategy where visible modification can be simplified to one value (mostly pointer) change, making it a linearization point, and looping over this modification if the value has change. Most of the time, these algorithms try to complete jobs of other threads when possible. A good example is the lock-free queue of Micheal&Scott (http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html).
For lower-level instructions like Compare-and-Swap, it means that the implementation (probably the micro-code of the corresponding instruction) is wait-free (see http://www.diku.dk/OLD/undervisning/2005f/dat-os/skrifter/lockfree.pdf)
For the sake of completeness, wait-free algorithm enforces progression for each threads: each operations are guaranteed to terminate in a finite number of steps.

Do I need Read-Write lock here

I have writing a multi threaded code. I am not sure, whether I would need a read and write lock mechanism. Could you please go through the usecase and tell me do I have to use read-write lock or just normal mutex will do.
Use case:
1) Class having two variables. These are accessed by every thread before doing operation.
2) When something goes wrong, these variables are updated to reflect the error scenarios.
Thus threads reading these variables can take different decisions (including abort)
Here, in second point, I need to update the data. And in first point, every thread will use the data. So, my question is do I have to use write lock while updating data and read lock while reading the data. (Note: Both variables are in memory. Just a boolean flag & string)
I am confused because as my both vars are in memory. So does OS take care when integrity. I mean I can live with 1 or 2 threads missing the updated value when some thread is writing the data in mutex.
Please tell if I am right or wrong? Also please tell If I have to use read-write lock or just normal mutex would do.
Update: I am sorry that I did not give platform and compiler name. I am on RHEL 5.0 and using gcc 4.6. My platform is x86_64. But I don not want my code to be OS specific because we are going to port the code shortly to Solaris 10.
First off, ignore those other answerers talking about volatile. Volatile is almost useless for multithreaded programming, and any false sense of safety given by it is just that - false.
Now, whether you need a lock depends on what you're doing with these variables. You will need a memory barrier at least (locks imply one).
So let's give an example:
One flag is an error flag. If zero, you continue, otherwise, you abort.
Another flag is a diagnostic code flag. It gives the precise reason for the error.
In this case, one option would be to do the following:
Read the error flag without a lock, but with read memory barriers after the read.
When an error occurs, take a lock, set the diagnostic code and error flags, then release the lock. If the diagnostic code is already set, release the lock immediately.
The memory barriers are needed, as otherwise the compiler (or CPU!) may choose to cache the same result for every read.
Of course, if the semantics of your two variables are different, the answer may vary. You'll need to be more specific.
Note that the exact mechanism for specifying locks and memory barriers depends on the compiler. C++0x provides a portable mechanism, but few compilers fully implement the C++0x standard yet. Please specify your compiler and OS for a more detailed answer.
As for your output data, you will almost certainly need a lock there. Try to avoid taking these locks too often though, as too much lock contention will kill your performance.
If they are atomic variables (C1x stdatomic.h or C++0x atomic), then you don't need read/write locks. With earlier C/C++ standards, there's no portable way of using multiple threads at all, so you need to look into how the implementation you are using does things. In most implementations, data types that can be accessed with a single machine instruction are atomic.
Note that just having an atomic variable is not enough -- you probably also need to declare it as volatile to guarantee that the compiler does not do things that will cause you to miss updates from other threads.
Thus threads reading these variables can take different decisions (including abort)
So each thread need to ensure that it reads the updated data. Also since the variables are shared, you need to take care about the race condition as well.
So in short - you need to use read/write locks when reading and writing to these shared variables.
See if you can use volatile variables - that should save you from using locks when you read the values (however write should still be with locks). This is applicable only because you said that -
I mean I can live with 1 or 2 threads missing the updated value when some thread is writing the data in mutex