Do sequence points prevent code reordering across critical section boundaries?

Do sequence points prevent code reordering across critical section boundaries? - c++

Suppose that one has some lock based code like the following where mutexes are used to guard against inappropriate concurrent read and write
mutex.get() ; // get a lock.
T localVar = pSharedMem->v ; // read something
pSharedMem->w = blah ; // write something.
pSharedMem->z++ ; // read and write something.
mutex.release() ; // release the lock.
If one assumed that the generated code was created in program order, there is still a requirement for appropriate hardware memory barriers like isync,lwsync,.acq,.rel. I'll assume for this question that the mutex implementation takes care of this part, providing a guarentee that the pSharedMem reads and writes all occur "after" the get, and "before" the release() [but that surrounding reads and writes can get into the critical section as I expect is the norm for mutex implementations]. I'll also assume that volatile accesses are used in the mutex implementation where appropriate, but that volatile is NOT used for the data protected by the mutex (understanding why volatile does not appear to be a requirement for the mutex protected Data is really part of this question).
I'd like to understand what prevents the compiler from moving the pSharedMem accesses outside of the critical region. In the C and C++ standards I see that there is a concept of sequence point. Much of the sequence point text in the standards docs I found incomprehensible, but if I was to guess what it was about, it is a statement that code should not be reordered across a point where there is a call with unknown side effects. Is that the jist of it? If that is the case what sort of optimization freedom does the compiler have here?
With compilers doing tricky optimizations like profile driven interprocedural inlining (even across file boundaries), even the concept of unknown side effect gets kind of blurry.
It is perhaps beyond the scope of a simple question to explain this in a self contained way here, so I am open to being pointed at references (preferrably online and targetted at mortal programmers not compiler writers and language designers).
EDIT: (in response to Jalf's reply)
I'd mentioned the memory barrier instructions like lwsync and isync because of the CPU reordering issues you also mentioned. I happen to work in the same lab as the compiler guys (for one of our platforms at least), and having talked to the implementers of the intrinsics I happen to know that at least for the xlC compiler __isync() and __lwsync() (and the rest of the atomic intrinsics) are also a code reordering barrier. In our spinlock implementation this is visible to the compiler since this part of our critical section is inlined.
However, suppose you weren't using a custom build lock implementation (like we happen to be, which is likely uncommon), and just called a generic interface such as pthread_mutex_lock(). There the compiler isn't informed anything more than the prototype. I've never seen it suggested that code would be non-functional
pthread_mutex_lock( &m ) ;
pSharedMem->someNonVolatileVar++ ;
pthread_mutex_unlock( &m ) ;
pthread_mutex_lock( &m ) ;
pSharedMem->someNonVolatileVar++ ;
pthread_mutex_unlock( &m ) ;
would be non-functional unless the variable was changed to volatile. That increment is going to have a load/increment/store sequence in each of the back to back blocks of code, and would not operate correctly if the value of the first increment is retained in-register for the second.
It seems likely that the unknown side effects of the pthread_mutex_lock() is what protects this back to back increment example from behaving incorrectly.
I'm talking myself into a conclusion that the semantics of a code sequence like this in a threaded environment is not really strictly covered by the C or C++ language specs.

In short, the compiler is allowed to reorder or transform the program as it likes, as long as the observable behavior on a C++ virtual machine does not change. The C++ standard has no concept of threads, and so this fictive VM only runs a single thread. And on such an imaginary machine, we don't have to worry about what other threads see. As long as the changes don't alter the outcome of the current thread, all code transformations are valid, including reordering memory accesses across sequence points.
understanding why volatile does not appear to be a requirement for the mutex protected Data is really part of this question
Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".
But that's all. When the write occurs, a write will be performed, and when the read occurs, a read will be performed. But it doesn't guarantee anything about when this read/write will take place. The compiler may, as it usually does, reorder operations as it sees fit (as long as it doesn't change the observable behavior in the current thread, the one that the imaginary C++ CPU knows about). So volatile doesn't really solve the problem. On the other hand, it offers a guarantee that we don't really need. We don't need every write to the variable to be written out immediately, we just want to ensure that they get written out before crossing this boundary. It's fine if they're cached until then - and likewise, once we've crossed the critical section boundary, subsequent writes can be cached again for all we care -- until we cross the boundary the next time. So volatile offers a too strong guarantee which we don't need, but doesn't offer the one we do need (that reads/writes won't get reordered)
So to implement critical sections, we need to rely on compiler magic. We have to tell it that "ok, forget about the C++ standard for a moment, I don't care what optimizations it would have allowed if you'd followed that strictly. You must NOT reorder any memory accesses across this boundary".
Critical sections are typically implemented via special compiler intrinsics (essentially special functions that are understood by the compiler), which 1) force the compiler to avoid reordering across that intrinsic, and 2) makes it emit the necessary instructions to get the CPU to respect the same boundary (because the CPU reorders instructions too, and without issuing a memory barrier instruction, we'd risk the CPU doing the same reordering that we just prevented the compiler from doing)

No, sequence points do not prevent rearranging of operations. The main, most broad rule that governs optimizations is the requirement imposed on so called observable behavior. Observable behavior, by definition, is read/write accesses to volatile variables and calls to library I/O functions. These events must occur in the same order and produce the same results as they would in the "canonically" executed program. Everything else can be rearranged and optimized absolutely freely by the compiler, in any way it sees fit, completely ignoring any orderings imposed by sequence points.
Of course, most compilers are trying not to do any excessively wild rearrangements. However, the issue you are mentioning has become a real practical issue with modern compilers in recent years. Many implementations offer additional implementation-specific mechanisms that allow user to ask the compiler not to cross certain boundaries when doing optimizing rearrangements.
Since, as you are saying, the protected data is not declared as volatile, formally speaking the access can be moved outside of the protected region. If you declare the data as volatile, it should prevent this from happening (assuming that mutex access is also volatile).

Let look at the following example:
my_pthread_mutex_lock( &m ) ;
someNonVolatileGlobalVar++ ;
my_pthread_mutex_unlock( &m ) ;
The function my_pthread_mutex_lock() is just calling pthread_mutex_lock(). By using my_pthread_mutex_lock(), I'm sure the compiler doesn't know that it's a synchronization function. For the compiler, it's just a function, and for me, it's a synchronization function that I can easily reimplement.
Because someNonVolatileGlobalVar is global, I expected the compiler doesn't move someNonVolatileGlobalVar++ outside the critical section. In fact, due to the observable behavior, even in a single thread situation, the compiler doesn't know if the function before and the one after this instruction are modifying the global var. So, to keep the observable behavior correct, it has to keep the execution order as it is written.
I hope pthread_mutex_lock() and pthread_mutex_unlock() also perform hardware memory barriers, in order to prevent the hardware moving this instruction outside the critical section.
do I am right ?
If I write:
my_pthread_mutex_lock( &m ) ;
someNonVolatileGlobalVar1++ ;
someNonVolatileGlobalVar2++ ;
my_pthread_mutex_unlock( &m ) ;
I cannot know which one of the two variables is incremented first, but this is normally not an issue.
Now, if I write:
someGlobalPointer = &someNonVolatileLocalVar;
my_pthread_mutex_lock( &m ) ;
someNonVolatileLocalVar++ ;
my_pthread_mutex_unlock( &m ) ;
or
someLocalPointer = &someNonVolatileGlobalVar;
my_pthread_mutex_lock( &m ) ;
(*someLocalPointer)++ ;
my_pthread_mutex_unlock( &m ) ;
Is the compiler doing what an ingenuous developer expects ?

C/C++ sequence points occur, for example, when ';' is encountered. At which point all side-effects of all operations that preceded it must occur. However, I'm fairly certain that by "side-effect" what's meant is operations that are part of the language itself (like z being incremented in 'z++') and not effects at lower/higher levels (like what the OS actually does with regards to memory management, thread management, etc. after an operation is completed).
Does that answer your question kinda? My point is really just that AFAIK the concept of sequence points doesn't really have anything to do with the side effects you're referring to.
hth

see something in [linux-kernel]/Documentation/memory-barriers.txt

Related

Are mutex lock functions sufficient without volatile?

A coworker and I write software for a variety of platforms running on x86, x64, Itanium, PowerPC, and other 10 year old server CPUs.
We just had a discussion about whether mutex functions such as pthread_mutex_lock() ... pthread_mutex_unlock() are sufficient by themselves, or whether the protected variable needs to be volatile.
int foo::bar()
{
//...
//code which may or may not access _protected.
pthread_mutex_lock(m);
int ret = _protected;
pthread_mutex_unlock(m);
return ret;
}
My concern is caching. Could the compiler place a copy of _protected on the stack or in a register, and use that stale value in the assignment? If not, what prevents that from happening? Are variations of this pattern vulnerable?
I presume that the compiler doesn't actually understand that pthread_mutex_lock() is a special function, so are we just protected by sequence points?
Thanks greatly.
Update: Alright, I can see a trend with answers explaining why volatile is bad. I respect those answers, but articles on that subject are easy to find online. What I can't find online, and the reason I'm asking this question, is how I'm protected without volatile. If the above code is correct, how is it invulnerable to caching issues?

Simplest answer is volatile is not needed for multi-threading at all.
The long answer is that sequence points like critical sections are platform dependent as is whatever threading solution you're using so most of your thread safety is also platform dependent.
C++0x has a concept of threads and thread safety but the current standard does not and therefore volatile is sometimes misidentified as something to prevent reordering of operations and memory access for multi-threading programming when it was never intended and can't be reliably used that way.
The only thing volatile should be used for in C++ is to allow access to memory mapped devices, allow uses of variables between setjmp and longjmp, and to allow uses of sig_atomic_t variables in signal handlers. The keyword itself does not make a variable atomic.
Good news in C++0x we will have the STL construct std::atomic which can be used to guarantee atomic operations and thread safe constructs for variables. Until your compiler of choice supports it you may need to turn to the boost library or bust out some assembly code to create your own objects to provide atomic variables.
P.S. A lot of the confusion is caused by Java and .NET actually enforcing multi-threaded semantics with the keyword volatile C++ however follows suit with C where this is not the case.

Your threading library should include the apropriate CPU and compiler barriers on mutex lock and unlock. For GCC, a memory clobber on an asm statement acts as a compiler barrier.
Actually, there are two things that protect your code from (compiler) caching:
You are calling a non-pure external function (pthread_mutex_*()), which means that the compiler doesn't know that that function doesn't modify your global variables, so it has to reload them.
As I said, pthread_mutex_*() includes a compiler barrier, e.g: on glibc/x86 pthread_mutex_lock() ends up calling the macro lll_lock(), which has a memory clobber, forcing the compiler to reload variables.

If the above code is correct, how is it invulnerable to caching
issues?
Until C++0x, it is not. And it is not specified in C. So, it really depends on the compiler. In general, if the compiler does not guarantee that it will respect ordering constraints on memory accesses for functions or operations that involve multiple threads, you will not be able to write multithreaded safe code with that compiler. See Hans J Boehm's Threads Cannot be Implemented as a Library.
As for what abstractions your compiler should support for thread safe code, the wikipedia entry on Memory Barriers is a pretty good starting point.
(As for why people suggested volatile, some compilers treat volatile as a memory barrier for the compiler. It's definitely not standard.)

The volatile keyword is a hint to the compiler that the variable might change outside of program logic, such as a memory-mapped hardware register that could change as part of an interrupt service routine. This prevents the compiler from assuming a cached value is always correct and would normally force a memory read to retrieve the value. This usage pre-dates threading by a couple decades or so. I've seen it used with variables manipulated by signals as well, but I'm not sure that usage was correct.
Variables guarded by mutexes are guaranteed to be correct when read or written by different threads. The threading API is required to ensure that such views of variables are consistent. This access is all part of your program logic and the volatile keyword is irrelevant here.

With the exception of the simplest spin lock algorithm, mutex code is quite involved: a good optimized mutex lock/unlock code contains the kind of code even excellent programmer struggle to understand. It uses special compare and set instructions, manages not only the unlocked/locked state but also the wait queue, optionally uses system calls to go into a wait state (for lock) or wake up other threads (for unlock).
There is no way the average compiler can decode and "understand" all that complex code (again, with the exception of the simple spin lock) no matter way, so even for a compiler not aware of what a mutex is, and how it relates to synchronization, there is no way in practice a compiler could optimize anything around such code.
That's if the code was "inline", or available for analyse for the purpose of cross module optimization, or if global optimization is available.
I presume that the compiler doesn't actually understand that
pthread_mutex_lock() is a special function, so are we just protected
by sequence points?
The compiler does not know what it does, so does not try to optimize around it.
How is it "special"? It's opaque and treated as such. It is not special among opaque functions.
There is no semantic difference with an arbitrary opaque function that can access any other object.
My concern is caching. Could the compiler place a copy of _protected
on the stack or in a register, and use that stale value in the
assignment?
Yes, in code that act on objects transparently and directly, by using the variable name or pointers in a way that the compiler can follow. Not in code that might use arbitrary pointers to indirectly use variables.
So yes between calls to opaque functions. Not across.
And also for variables which can only be used in the function, by name: for local variables that don't have either their address taken or a reference bound to them (such that the compiler cannot follow all further uses). These can indeed be "cached" across arbitrary calls include lock/unlock.
If not, what prevents that from happening? Are variations of this
pattern vulnerable?
Opacity of the functions. Non inlining. Assembly code. System calls. Code complexity. Everything that make compilers bail out and think "that's complicated stuff just make calls to it".
The default position of a compiler is always the "let's execute stupidly I don't understand what is being done anyway" not "I will optimize that/let's rewrite the algorithm I know better". Most code is not optimized in complex non local way.
Now let's assume the absolute worse (from out point of view which is that the compiler should give up, that is the absolute best from the point of view of an optimizing algorithm):
the function is "inline" (= available for inlining) (or global optimization kicks in, or all functions are morally "inline");
no memory barrier is needed (as in a mono-processor time sharing system, and in a multi-processor strongly ordered system) in that synchronization primitive (lock or unlock) so it contains no such thing;
there is no special instruction (like compare and set) used (for example for a spin lock, the unlock operation is a simple write);
there is no system call to pause or wake threads (not needed in a spin lock);
then we might have a problem as the compiler could optimize around the function call. This is fixed trivially by inserting a compiler barrier such as an empty asm statement with a "clobber" for other accessible variables. That means that compiler just assumes that anything that might be accessible to a called function is "clobbered".
or whether the protected variable needs to be volatile.
You can make it volatile for the usual reason you make things volatile: to be certain to be able to access the variable in the debugger, to prevent a floating point variable from having the wrong datatype at runtime, etc.
Making it volatile would actually not even fix the issue described above as volatile is essentially a memory operation in the abstract machine that has the semantics of an I/O operation and as such is only ordered with respect to
real I/O like iostream
system calls
other volatile operations
asm memory clobbers (but then no memory side effect is reordered around those)
calls to external functions (as they might do one the above)
Volatile is not ordered with respect to non volatile memory side effects. That makes volatile practically useless (useless for practical uses) for writing thread safe code in even the most specific case where volatile would a priori help, the case where no memory fence is ever needed: when programming threading primitives on a time sharing system on a single CPU. (That may be one of the least understood aspects of either C or C++.)
So while volatile does prevent "caching", volatile doesn't even prevent compiler reordering of lock/unlock operation unless all shared variables are volatile.

Locks/synchronisation primitives make sure the data is not cached in registers/cpu cache, that means data propagates to memory. If two threads are accessing/ modifying data with in locks, it is guaranteed that data is read from memory and written to memory. We don't need volatile in this use case.
But the case where you have code with double checks, compiler can optimise the code and remove redundant code, to prevent that we need volatile.
Example: see singleton pattern example
https://en.m.wikipedia.org/wiki/Singleton_pattern#Lazy_initialization
Why do some one write this kind of code?
Ans: There is a performance benefit of not accuiring lock.
PS: This is my first post on stack overflow.

Not if the object you're locking is volatile, eg: if the value it represents depends on something foreign to the program (hardware state).
volatile should NOT be used to denote any kind of behavior that is the result of executing the program.
If it's actually volatile what I personally would do is locking the value of the pointer/address, instead of the underlying object.
eg:
volatile int i = 0;
// ... Later in a thread
// ... Code that may not access anything without a lock
std::uintptr_t ptr_to_lock = &i;
some_lock(ptr_to_lock);
// use i
release_some_lock(ptr_to_lock);
Please note that it only works if ALL the code ever using the object in a thread locks the same address. So be mindful of that when using threads with some variable that is part of an API.

What are the threading guarantees of nowadays C and C++ compilers?

I'm wondering what are the guarantees that compilers make to ensure that threaded writes to memory have visible effects in other threads.
I know countless cases in which this is problematic, and I'm sure that if you're interested in answering you know it too, but please focus on the cases I'll be presenting.
More precisely, I am concerned about the circumstances that can lead to threads missing memory updates done by other threads. I don't care (at this point) if the updates are non-atomic or badly synchronized: as long as the concerned threads notice the changes, I'll be happy.
I hope that compilers makes the distinction between two kinds of variable accesses:
Accesses to variables that necessarily have an address;
Accesses to variables that don't necessarily have an address.
For instance, if you take this snippet:
void sleepingbeauty()
{
int i = 1;
while (i) sleep(1);
}
Since i is a local, I assume that my compiler can optimize it away, and just let the sleeping beauty fall to eternal slumber.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variable, and generate memory reads to it to ensure that maybe some day the prince will come.
int i = 1;
void sleepingbeauty()
{
while (i) sleep(1);
}
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it instead of caching the value.
void sleepingbeauty(int* ptr)
{
*ptr = 1;
while (*ptr) sleep(1);
}
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
I'm fairly sure that this is the memory access model used by every C and C++ compiler in production out there, but I don't think there are any guarantees. In fact, the C++03 is even blind to the existence of threads, so this question wouldn't even make sense with the standard in mind. I'm not sure about C, though.
Is there some documentation out there that specifies if I'm right or wrong? I know these are muddy waters since these may not be on standards grounds, it seems like an important issue to me.
Besides the compiler generating reads, I'm also worried that the CPU cache could technically retain an outdated value, and that even though my compiler did its best to bring the reads and writes about, the values never synchronise between threads. Can this happen?

Accesses to variables that don't necessarily have an address.
All variables must have addresses (from the language's prospective -- compilers are allowed to avoid giving things addresses if they can, but that's not visible from inside the language). It's a side effect that everything must be "pointerable" that everything has an address -- even the empty class typically has size of at least a char so that a pointer can be created to it.
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variables, and generate memory reads to it to ensure that maybe some day the prince will come.
That depends on the content of onedaymyprincewillcome. The compiler may inline that function if it wishes and still make no memory reads.
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it.
Yes, but it really doesn't matter if there are reads to it. These reads might simply be going to cache on your current local CPU core, not actually going all the way back to main memory. You would need something like a memory barrier for this, and no C++ compiler is going to do that for you.
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
Nope -- not required. The function may be inlined, which would allow the compiler to completely remove these things if it so desires.
The only language feature in the standard that lets you control things like this w.r.t. threading is volatile, which simply requires that the compiler generate reads. That does not mean the value will be consistent though because of the CPU cache issue -- you need memory barriers for that.
If you need true multithreading correctness, you're going to be using some platform specific library to generate memory barriers and things like that, or you're going to need a C++0x compiler which supports std::atomic, which does make these kinds of requirements on variables explicit.

You assume wrong.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
In this code, your compiler will load i from memory each time through the loop. Why? NOT because it thinks another thread could alter its value, but because it thinks that sleep could modify its value. It has nothing to do with whether or not i has an address or must have an address, and everything to do with the operations that this thread performs which could modify the code.
In particular, it is not guaranteed that assigning to an int is even atomic, although this happens to be true on all platforms we use these days.
Too many things go wrong if you don't use the proper synchronization primitives for your threaded programs. For example,
char *str = 0;
asynch_get_string(&str);
while (!str)
sleep(1);
puts(str);
This could (and even will, on some platforms) sometimes print out utter garbage and crash the program. It looks safe, but because you are not using the proper synchronization primitives, the change to ptr could be seen by your thread before the change to the memory location it refers to, even though the other thread initializes the string before setting the pointer.
So just don't, don't, don't do this kind of stuff. And no, volatile is not a fix.
Summary: The basic problem is that the compiler only changes what order the instructions go in, and where the load and store operations go. This is not enough to guarantee thread safety in general, because the processor is free to change the order of loads and stores, and the order of loads and stores is not preserved between processors. In order to ensure things happen in the right order, you need memory barriers. You can either write the assembly yourself or you can use a mutex / semaphore / critical section / etc, which does the right thing for you.

While the C++98 and C++03 standards do not dictate a standard memory model that must be used by compilers, C++0x does, and you can read about it here: http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf
In the end, for C++98 and C++03, it's really up to the compiler and the hardware platform. Typically there will not be any memory barrier or fence-operation issued by the compiler for normally written code unless you use a compiler intrinsic or something from your OS's standard library for synchronization. Most mutex/semaphore implementations also include a built-in memory barrier operation to prevent speculative reads and writes across the locking and unlocking operations on the mutex by the CPU, as well as prevent any re-ordering of operations across the same read or write calls by the compiler.
Finally, as Billy points out in the comments, on Intel x86 and x86_64 platforms, any read or write operation in a single byte increment is atomic, as well as a read or write of a register value to any 4-byte aligned memory location on x86 and 4 or 8-byte aligned memory location on x86_64. On other platforms, that may not be the case and you would have to consult the platform's documentation.

The only control you have over optimisation is volatile.
Compilers make NO gaurantee about concurrent threads accessing the same location at the same time. You will need to some type of locking mechanism.

I can only speak for C and since synchronization is a CPU-implemented functionality a C programmer would need to call a library function for the OS containg an access to the lock (CriticalSection functions in the Windows NT engine) or implement something simpler (such as a spinlock) and access the functionality himself.
volatile is a good property to use at the module level. Sometimes a non-static (public) variable will work too.
local (stack) variables will not be accessible from other threads and should not be.
variables at the module level are good candidates for access by multiple threads but will require synchronizetion functions to work predictably.
Locks are unavoidable but they can be used more or less wisely resulting in a negligible or considerable performance penalty.
I answered a similar question here concerning unsynchronized threads but I think you'll be better off browsing on similar topics to get high-quality answers.

I'm not sure you understand the basics of the topic you claim to be discussing. Two threads, each starting at exactly the same time and looping one million times each performing an inc on the same variable will NOT result in a final value of two million (two * one million increments). The value will end up somewhere in-between one and two million.
The first increment will cause the value to be read from RAM into the L1 (via first the L3 then the L2) cache of the accessing thread/core. The increment is performed and the new value written initially to L1 for propagation to lower caches. When it reaches L3 (the highest cache common to both cores) the memory location will be invalidated to the other core's caches. This may seem safe but in the meantime the other core has simultaneously performed an increment based on the same initial value in the variable. The invalidation from the write by the first value will be superseeded by the write from the second core invalidating the data in the caches of the first core.
Sounds like a mess? It is! The cores are so fast that what happens in the caches falls way behind: the cores are where the action is. This is why you need explicit locks: to make sure that the new value winds up low enough in the memory hierarchy such that other cores will read the new value and nothing else. Or put another way: slow things down so the caches can catch up with the cores.
A compiler does not "feel." A compiler is rule-based and, if constructed correctly, will optimize to the extent that the rules allow and the compiler writers are able to construct the optimizer. If a variable is volatile and the code is multi-threaded the rules won't allow the compiler to skip a read. Simple as that even though on the face of it it may appear devilishly tricky.
I'll have to repeat myself and say that locks cannot be implemented in a compiler because they are specific to the OS. The generated code will call all functions without knowing if they are empty, contain lock code or will trigger a nuclear explosion. In the same way the code will not be aware of a lock being in progress since the core will insert wait states until the lock request has resulted in the lock being in place. The lock is something that exists in the core and in the mind of the programmer. The code shouldn't (and doesn't!) care.

I'm writing this answer because most of the help came from comments to questions, and not always from the authors of the answers. I already upvoted the answers that helped me most, and I'm making this a community wiki to not abuse the knowledge of others. (If you want to upvote this answer, consider also upvoting Billy's and Dietrich's answers too: they were the most helpful authors to me.)
There are two problems to address when values written from a thread need to be visible from another thread:
Caching (a value written from a CPU could never make it to another CPU);
Optimizations (a compiler could optimize away the reads to a variable if it feels it can't be changed).
The first one is rather easy. On modern Intel processors, there is a concept of cache coherence, which means changes to a cache propagate to other CPU caches.
Turns out the optimization part isn't too hard either. As soon as the compiler cannot guarantee that a function call cannot change the content of a variable, even in a single-threaded model, it won't optimize the reads away. In my examples, the compiler doesn't know that sleep cannot change i, and this is why reads are issued at every operation. It doesn't need to be sleep though, any function for which the compiler doesn't have the implementation details would do. I suppose that a particularly well-suited function to use would be one that emits a memory barrier.
In the future, it's possible that compilers will have better knowledge of currently impenetrable functions. However, when that time will come, I expect that there will be standard ways to ensure that changes are propagated correctly. (This is coming with C++11 and the std::atomic<T> class. I don't know for C1x.)

Why is volatile not considered useful in multithreaded C or C++ programming?

As demonstrated in this answer I recently posted, I seem to be confused about the utility (or lack thereof) of volatile in multi-threaded programming contexts.
My understanding is this: any time a variable may be changed outside the flow of control of a piece of code accessing it, that variable should be declared to be volatile. Signal handlers, I/O registers, and variables modified by another thread all constitute such situations.
So, if you have a global int foo, and foo is read by one thread and set atomically by another thread (probably using an appropriate machine instruction), the reading thread sees this situation in the same way it sees a variable tweaked by a signal handler or modified by an external hardware condition and thus foo should be declared volatile (or, for multithreaded situations, accessed with memory-fenced load, which is probably a better a solution).
How and where am I wrong?

The problem with volatile in a multithreaded context is that it doesn't provide all the guarantees we need. It does have a few properties we need, but not all of them, so we can't rely on volatile alone.
However, the primitives we'd have to use for the remaining properties also provide the ones that volatile does, so it is effectively unnecessary.
For thread-safe accesses to shared data, we need a guarantee that:
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to be read. In our code, we simply set the flag after preparing the data, so all looks fine. But what if the instructions are reordered so the flag is set first?
volatile does guarantee the first point. It also guarantees that no reordering occurs between different volatile reads/writes. All volatile memory accesses will occur in the order in which they're specified. That is all we need for what volatile is intended for: manipulating I/O registers or memory-mapped hardware, but it doesn't help us in multithreaded code where the volatile object is often only used to synchronize access to non-volatile data. Those accesses can still be reordered relative to the volatile ones.
The solution to preventing reordering is to use a memory barrier, which indicates both to the compiler and the CPU that no memory access may be reordered across this point. Placing such barriers around our volatile variable access ensures that even non-volatile accesses won't be reordered across the volatile one, allowing us to write thread-safe code.
However, memory barriers also ensure that all pending reads/writes are executed when the barrier is reached, so it effectively gives us everything we need by itself, making volatile unnecessary. We can just remove the volatile qualifier entirely.
Since C++11, atomic variables (std::atomic<T>) give us all of the relevant guarantees.

You might also consider this from the Linux Kernel Documentation.
C programmers have often taken volatile to mean that the variable
could be changed outside of the current thread of execution; as a
result, they are sometimes tempted to use it in kernel code when
shared data structures are being used. In other words, they have been
known to treat volatile types as a sort of easy atomic variable, which
they are not. The use of volatile in kernel code is almost never
correct; this document describes why.
The key point to understand with regard to volatile is that its
purpose is to suppress optimization, which is almost never what one
really wants to do. In the kernel, one must protect shared data
structures against unwanted concurrent access, which is very much a
different task. The process of protecting against unwanted
concurrency will also avoid almost all optimization-related problems
in a more efficient way.
Like volatile, the kernel primitives which make concurrent access to
data safe (spinlocks, mutexes, memory barriers, etc.) are designed to
prevent unwanted optimization. If they are being used properly, there
will be no need to use volatile as well. If volatile is still
necessary, there is almost certainly a bug in the code somewhere. In
properly-written kernel code, volatile can only serve to slow things
down.
Consider a typical block of kernel code:
spin_lock(&the_lock);
do_something_on(&shared_data);
do_something_else_with(&shared_data);
spin_unlock(&the_lock);
If all the code follows the locking rules, the value of shared_data
cannot change unexpectedly while the_lock is held. Any other code
which might want to play with that data will be waiting on the lock.
The spinlock primitives act as memory barriers - they are explicitly
written to do so - meaning that data accesses will not be optimized
across them. So the compiler might think it knows what will be in
shared_data, but the spin_lock() call, since it acts as a memory
barrier, will force it to forget anything it knows. There will be no
optimization problems with accesses to that data.
If shared_data were declared volatile, the locking would still be
necessary. But the compiler would also be prevented from optimizing
access to shared_data within the critical section, when we know that
nobody else can be working with it. While the lock is held,
shared_data is not volatile. When dealing with shared data, proper
locking makes volatile unnecessary - and potentially harmful.
The volatile storage class was originally meant for memory-mapped I/O
registers. Within the kernel, register accesses, too, should be
protected by locks, but one also does not want the compiler
"optimizing" register accesses within a critical section. But, within
the kernel, I/O memory accesses are always done through accessor
functions; accessing I/O memory directly through pointers is frowned
upon and does not work on all architectures. Those accessors are
written to prevent unwanted optimization, so, once again, volatile is
unnecessary.
Another situation where one might be tempted to use volatile is when
the processor is busy-waiting on the value of a variable. The right
way to perform a busy wait is:
while (my_variable != what_i_want)
cpu_relax();
The cpu_relax() call can lower CPU power consumption or yield to a
hyperthreaded twin processor; it also happens to serve as a memory
barrier, so, once again, volatile is unnecessary. Of course,
busy-waiting is generally an anti-social act to begin with.
There are still a few rare situations where volatile makes sense in
the kernel:
The above-mentioned accessor functions might use volatile on
architectures where direct I/O memory access does work. Essentially,
each accessor call becomes a little critical section on its own and
ensures that the access happens as expected by the programmer.
Inline assembly code which changes memory, but which has no other
visible side effects, risks being deleted by GCC. Adding the volatile
keyword to asm statements will prevent this removal.
The jiffies variable is special in that it can have a different value
every time it is referenced, but it can be read without any special
locking. So jiffies can be volatile, but the addition of other
variables of this type is strongly frowned upon. Jiffies is considered
to be a "stupid legacy" issue (Linus's words) in this regard; fixing it
would be more trouble than it is worth.
Pointers to data structures in coherent memory which might be modified
by I/O devices can, sometimes, legitimately be volatile. A ring buffer
used by a network adapter, where that adapter changes pointers to
indicate which descriptors have been processed, is an example of this
type of situation.
For most code, none of the above justifications for volatile apply.
As a result, the use of volatile is likely to be seen as a bug and
will bring additional scrutiny to the code. Developers who are
tempted to use volatile should take a step back and think about what
they are truly trying to accomplish.

I don't think you're wrong -- volatile is necessary to guarantee that thread A will see the value change, if the value is changed by something other than thread A. As I understand it, volatile is basically a way to tell the compiler "don't cache this variable in a register, instead be sure to always read/write it from RAM memory on every access".
The confusion is because volatile isn't sufficient for implementing a number of things. In particular, modern systems use multiple levels of caching, modern multi-core CPUs do some fancy optimizations at run-time, and modern compilers do some fancy optimizations at compile time, and these all can result in various side effects showing up in a different order from the order you would expect if you just looked at the source code.
So volatile is fine, as long as you keep in mind that the 'observed' changes in the volatile variable may not occur at the exact time you think they will. Specifically, don't try to use volatile variables as a way to synchronize or order operations across threads, because it won't work reliably.
Personally, my main (only?) use for the volatile flag is as a "pleaseGoAwayNow" boolean. If I have a worker thread that loops continuously, I'll have it check the volatile boolean on each iteration of the loop, and exit if the boolean is ever true. The main thread can then safely clean up the worker thread by setting the boolean to true, and then calling pthread_join() to wait until the worker thread is gone.

volatile is useful (albeit insufficient) for implementing the basic construct of a spinlock mutex, but once you have that (or something superior), you don't need another volatile.
The typical way of multithreaded programming is not to protect every shared variable at the machine level, but rather to introduce guard variables which guide program flow. Instead of volatile bool my_shared_flag; you should have
pthread_mutex_t flag_guard_mutex; // contains something volatile
bool my_shared_flag;
Not only does this encapsulate the "hard part," it's fundamentally necessary: C does not include atomic operations necessary to implement a mutex; it only has volatile to make extra guarantees about ordinary operations.
Now you have something like this:
pthread_mutex_lock( &flag_guard_mutex );
my_local_state = my_shared_flag; // critical section
pthread_mutex_unlock( &flag_guard_mutex );
pthread_mutex_lock( &flag_guard_mutex ); // may alter my_shared_flag
my_shared_flag = ! my_shared_flag; // critical section
pthread_mutex_unlock( &flag_guard_mutex );
my_shared_flag does not need to be volatile, despite being uncacheable, because
Another thread has access to it.
Meaning a reference to it must have been taken sometime (with the & operator).
(Or a reference was taken to a containing structure)
pthread_mutex_lock is a library function.
Meaning the compiler can't tell if pthread_mutex_lock somehow acquires that reference.
Meaning the compiler must assume that pthread_mutex_lock modifes the shared flag!
So the variable must be reloaded from memory. volatile, while meaningful in this context, is extraneous.

Your understanding really is wrong.
The property, that the volatile variables have, is "reads from and writes to this variable are part of perceivable behaviour of the program". That means this program works (given appropriate hardware):
int volatile* reg=IO_MAPPED_REGISTER_ADDRESS;
*reg=1; // turn the fuel on
*reg=2; // ignition
*reg=3; // release
int x=*reg; // fire missiles
The problem is, this is not the property we want from thread-safe anything.
For example, a thread-safe counter would be just (linux-kernel-like code, don't know the c++0x equivalent):
atomic_t counter;
...
atomic_inc(&counter);
This is atomic, without a memory barrier. You should add them if necessary. Adding volatile would probably not help, because it wouldn't relate the access to the nearby code (eg. to appending of an element to the list the counter is counting). Certainly, you don't need to see the counter incremented outside your program, and optimisations are still desirable, eg.
atomic_inc(&counter);
atomic_inc(&counter);
can still be optimised to
atomically {
counter+=2;
}
if the optimizer is smart enough (it doesn't change the semantics of the code).

For your data to be consistent in a concurrent environment you need two conditions to apply:
1) Atomicity i.e if I read or write some data to memory then that data gets read/written in one pass and cannot be interrupted or contended due to e.g a context switch
2) Consistency i.e the order of read/write ops must be seen to be the same between multiple concurrent environments - be that threads, machines etc
volatile fits neither of the above - or more particularly, the c or c++ standard as to how volatile should behave includes neither of the above.
It's even worse in practice as some compilers ( such as the intel Itanium compiler ) do attempt to implement some element of concurrent access safe behaviour ( i.e by ensuring memory fences ) however there is no consistency across compiler implementations and moreover the standard does not require this of the implementation in the first place.
Marking a variable as volatile will just mean that you are forcing the value to be flushed to and from memory each time which in many cases just slows down your code as you've basically blown your cache performance.
c# and java AFAIK do redress this by making volatile adhere to 1) and 2) however the same cannot be said for c/c++ compilers so basically do with it as you see fit.
For some more in depth ( though not unbiased ) discussion on the subject read this

The comp.programming.threads FAQ has a classic explanation by Dave Butenhof:
Q56: Why don't I need to declare shared variables VOLATILE?
I'm concerned, however, about cases where both the compiler and the
threads library fulfill their respective specifications. A conforming
C compiler can globally allocate some shared (nonvolatile) variable to
a register that gets saved and restored as the CPU gets passed from
thread to thread. Each thread will have it's own private value for
this shared variable, which is not what we want from a shared
variable.
In some sense this is true, if the compiler knows enough about the
respective scopes of the variable and the pthread_cond_wait (or
pthread_mutex_lock) functions. In practice, most compilers will not try
to keep register copies of global data across a call to an external
function, because it's too hard to know whether the routine might
somehow have access to the address of the data.
So yes, it's true that a compiler that conforms strictly (but very
aggressively) to ANSI C might not work with multiple threads without
volatile. But someone had better fix it. Because any SYSTEM (that is,
pragmatically, a combination of kernel, libraries, and C compiler) that
does not provide the POSIX memory coherency guarantees does not CONFORM
to the POSIX standard. Period. The system CANNOT require you to use
volatile on shared variables for correct behavior, because POSIX
requires only that the POSIX synchronization functions are necessary.
So if your program breaks because you didn't use volatile, that's a BUG.
It may not be a bug in C, or a bug in the threads library, or a bug in
the kernel. But it's a SYSTEM bug, and one or more of those components
will have to work to fix it.
You don't want to use volatile, because, on any system where it makes
any difference, it will be vastly more expensive than a proper
nonvolatile variable. (ANSI C requires "sequence points" for volatile
variables at each expression, whereas POSIX requires them only at
synchronization operations -- a compute-intensive threaded application
will see substantially more memory activity using volatile, and, after
all, it's the memory activity that really slows you down.)
/---[ Dave Butenhof ]-----------------------[ butenhof#zko.dec.com ]---\
| Digital Equipment Corporation 110 Spit Brook Rd ZKO2-3/Q18 |
| 603.881.2218, FAX 603.881.0120 Nashua NH 03062-2698 |
-----------------[ Better Living Through Concurrency ]----------------/
Mr Butenhof covers much of the same ground in this usenet post:
The use of "volatile" is not sufficient to ensure proper memory
visibility or synchronization between threads. The use of a mutex is
sufficient, and, except by resorting to various non-portable machine
code alternatives, (or more subtle implications of the POSIX memory
rules that are much more difficult to apply generally, as explained in
my previous post), a mutex is NECESSARY.
Therefore, as Bryan explained, the use of volatile accomplishes
nothing but to prevent the compiler from making useful and desirable
optimizations, providing no help whatsoever in making code "thread
safe". You're welcome, of course, to declare anything you want as
"volatile" -- it's a legal ANSI C storage attribute, after all. Just
don't expect it to solve any thread synchronization problems for you.
All that's equally applicable to C++.

This is all that "volatile" is doing:
"Hey compiler, this variable could change AT ANY MOMENT (on any clock tick) even if there are NO LOCAL INSTRUCTIONS acting on it. Do NOT cache this value in a register."
That is IT. It tells the compiler that your value is, well, volatile- this value may be altered at any moment by external logic (another thread, another process, the Kernel, etc.). It exists more or less solely to suppress compiler optimizations that will silently cache a value in a register that it is inherently unsafe to EVER cache.
You may encounter articles like "Dr. Dobbs" that pitch volatile as some panacea for multi-threaded programming. His approach isn't totally devoid of merit, but it has the fundamental flaw of making an object's users responsible for its thread-safety, which tends to have the same issues as other violations of encapsulation.

According to my old C standard, “What constitutes an access to an object that has volatile- qualified type is implementation-defined”. So C compiler writers could have choosen to have "volatile" mean "thread safe access in a multi-process environment". But they didn't.
Instead, the operations required to make a critical section thread safe in a multi-core multi-process shared memory environment were added as new implementation-defined features. And, freed from the requirement that "volatile" would provide atomic access and access ordering in a multi-process environment, the compiler writers prioritised code-reduction over historical implemention-dependant "volatile" semantics.
This means that things like "volatile" semaphores around critical code sections, which do not work on new hardware with new compilers, might once have worked with old compilers on old hardware, and old examples are sometimes not wrong, just old.

c++ volatile multithreading variables

I'm writing a C++ app.
I have a class variable that more than one thread is writing to.
In C++, anything that can be modified without the compiler "realizing" that it's being changed needs to be marked volatile right? So if my code is multi threaded, and one thread may write to a var while another reads from it, do I need to mark the var volaltile?
[I don't have a race condition since I'm relying on writes to ints being atomic]
Thanks!

C++ hasn't yet any provision for multithreading. In practice, volatile doesn't do what you mean (it has been designed for memory adressed hardware and while the two issues are similar they are different enough that volatile doesn't do the right thing -- note that volatile has been used in other language for usages in mt contexts).
So if you want to write an object in one thread and read it in another, you'll have to use synchronization features your implementation needs when it needs them. For the one I know of, volatile play no role in that.
FYI, the next standard will take MT into account, and volatile will play no role in that. So that won't change. You'll just have standard defined conditions in which synchronization is needed and standard defined way of achieving them.

Yes, volatile is the absolute minimum you'll need. It ensures that the code generator won't generate code that stores the variable in a register and always performs reads and writes from/to memory. Most code generators can provide atomicity guarantees on variables that have the same size as the native CPU word, they'll ensure the memory address is aligned so that the variable cannot straddle a cache-line boundary.
That is however not a very strong contract on modern multi-core CPUs. Volatile does not promise that another thread that runs on another core can see updates to the variable. That requires a memory barrier, usually an instruction that flushes the CPU cache. If you don't provide a barrier, the thread will in effect keep running until such a flush occurs naturally. That will eventually happen, the thread scheduler is bound to provide one. That can take milliseconds.
Once you've taken care of details like this, you'll eventually have re-invented a condition variable (aka event) that isn't likely to be any faster than the one provided by a threading library. Or as well tested. Don't invent your own, threading is hard enough to get right, you don't need the FUD of not being sure that the very basic primitives are solid.

volatile instruct the compiler not to optimize upon "intuition" of a variable value or usage since it could be optimize "from the outside".
volatile won't provide any synchronization however and your assumption of writes to int being atomic are all but realistic!
I'd guess we'd need to see some usage to know if volatile is needed in your case (or check the behavior of your program) or more importantly if you see some sort of synchronization.

I think that volatile only really applies to reading, especially reading memory-mapped I/O registers.
It can be used to tell the compiler to not assume that once it has read from a memory location that the value won't change:
while (*p)
{
// ...
}
In the above code, if *p is not written to within the loop, the compiler might decide to move the read outside the loop, more like this:
cached_p=*p
while (cached_p)
{
// ...
}
If p is a pointer to a memory-mapped I/O port, you would want the first version where the port is checked before the loop is entered every time.
If p is a pointer to memory in a multi-threaded app, you're still not guaranteed that writes are atomic.

Without locking you may still get 'impossible' re-orderings done by the compiler or processor. And there's no guarantee that writes to ints are atomic.
It would be better to use proper locking.

Volatile will solve your problem, ie. it will guarantee consistency among all the caches of the system. However it will be inefficiency since it will update the variable in memory for each R or W access. You might concider using a memory barrier, only whenever it is needed, instead.
If you are working with or gcc/icc have look on sync built-ins : http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
EDIT (mostly about pm100 comment):
I understand that my beliefs are not a reference so I found something to quote :)
The volatile keyword was devised to prevent compiler optimizations that might render code incorrect in the presence of certain asynchronous events. For example, if you declare a primitive variable as volatile, the compiler is not permitted to cache it in a register
From Dr Dobb's
More interesting :
Volatile fields are linearizable. Reading a volatile field is like acquiring a lock; the working memory is invalidated and the volatile field's current value is reread from memory. Writing a volatile field is like releasing a lock : the volatile field is immediately written back to memory.
(this is all about consistency, not about atomicity)
from The Art of multiprocessor programming, Maurice Herlihy & Nir Shavit
Lock contains memory synchronization code, if you don't lock, you must do something and using volatile keyword is probably the simplest thing you can do (even if it was designed for external devices with memory binded to the address space, it's not the point here)

does presence of mutex help getting rid of volatile key word ?

I have a multi-R/W lock class that keeps the read, write and pending read , pending write counters. A mutex guards them from multiple threads.
My question is Do we still need the counters to be declared as volatile so that the compiler won't screw it up while doing the optimization.
Or does the compiler takes into account that the counters are guarded by the mutex.
I understand that the mutex is a run time mechanism to for synchronization and "volatile" keyword is a compile time indication to the compiler to do the right thing while doing the optimizations.
Regards,
-Jay.

There are 2 basically unrelated items here, that are always confused.
volatile
threads, locks, memory barriers, etc.
volatile is used to tell the compiler to produce code to read the variable from memory, not from a register. And to not reorder the code around. In general, not to optimize or take 'short-cuts'.
memory barriers (supplied by mutexes, locks, etc), as quoted from Herb Sutter in another answer, are for preventing the CPU from reordering read/write memory requests, regardless of how the compiler said to do it. ie don't optimize, don't take short cuts - at the CPU level.
Similar, but in fact very different things.
In your case, and in most cases of locking, the reason that volatile is NOT necessary, is because of function calls being made for the sake of locking. ie:
Normal function calls affecting optimizations:
external void library_func(); // from some external library
global int x;
int f()
{
x = 2;
library_func();
return x; // x is reloaded because it may have changed
}
unless the compiler can examine library_func() and determine that it doesn't touch x, it will re-read x on the return. This is even WITHOUT volatile.
Threading:
int f(SomeObject & obj)
{
int temp1;
int temp2;
int temp3;
int temp1 = obj.x;
lock(obj.mutex); // really should use RAII
temp2 = obj.x;
temp3 = obj.x;
unlock(obj.mutex);
return temp;
}
After reading obj.x for temp1, the compiler is going to re-read obj.x for temp2 - NOT because of the magic of locks - but because it is unsure whether lock() modified obj. You could probably set compiler flags to aggressively optimize (no-alias, etc) and thus not re-read x, but then a bunch of your code would probably start failing.
For temp3, the compiler (hopefully) won't re-read obj.x.
If for some reason obj.x could change between temp2 and temp3, then you would use volatile (and your locking would be broken/useless).
Lastly, if your lock()/unlock() functions were somehow inlined, maybe the compiler could evaluate the code and see that obj.x doesn't get changed. But I guarantee one of two things here:
- the inline code eventually calls some OS level lock function (thus preventing evaluation) or
- you call some asm memory barrier instructions (ie that are wrapped in inline functions like __InterlockedCompareExchange) that your compiler will recognize and thus avoid reordering.
EDIT: P.S. I forgot to mention - for pthreads stuff, some compilers are marked as "POSIX compliant" which means, among other things, that they will recognize the pthread_ functions and not do bad optimizations around them. ie even though the C++ standard doesn't mention threads yet, those compilers do (at least minimally).
So, short answer
you don't need volatile.

From Herb Sutter's article "Use Critical Sections (Preferably Locks) to Eliminate Races" (http://www.ddj.com/cpp/201804238):
So, for a reordering transformation to be valid, it must respect the program's critical sections by obeying the one key rule of critical sections: Code can't move out of a critical section. (It's always okay for code to move in.) We enforce this golden rule by requiring symmetric one-way fence semantics for the beginning and end of any critical section, illustrated by the arrows in Figure 1:
Entering a critical section is an acquire operation, or an implicit acquire fence: Code can never cross the fence upward, that is, move from an original location after the fence to execute before the fence. Code that appears before the fence in source code order, however, can happily cross the fence downward to execute later.
Exiting a critical section is a release operation, or an implicit release fence: This is just the inverse requirement that code can't cross the fence downward, only upward. It guarantees that any other thread that sees the final release write will also see all of the writes before it.
So for a compiler to produce correct code for a target platform, when a critical section is entered and exited (and the term critical section is used in it's generic sense, not necessarily in the Win32 sense of something protected by a CRITICAL_SECTION structure - the critical section can be protected by other synchronization objects) the correct acquire and release semantics must be followed. So you should not have to mark the shared variables as volatile as long as they are accessed only within protected critical sections.

volatile is used to inform the optimizer to always load the current value of the location, rather than load it into a register and assume that it won't change. This is most valuable when working with dual-ported memory locations or locations that can be updated real-time from sources external to the thread.
The mutex is a run-time OS mechanism that the compiler really doesn't know anything about - so the optimizer wouldn't take that into account. It will prevent more than one thread from accessing the counters at one time, but the values of those counters are still subject to change even while the mutex is in effect.
So, you're marking the vars volatile because they can be externally modified, and not because they're inside a mutex guard.
Keep them volatile.

While this may depend on the threading library you are using, my understanding is that any decent library will not require use of volatile.
In Pthreads, for example, use of a mutex will ensure that your data gets committed to memory correctly.
EDIT: I hereby endorse tony's answer as being better than my own.

You still need the "volatile" keyword.
The mutexes prevent the counters from concurrent access.
"volatile" tells the compiler to actually use the counter
instead of caching it into a CPU register (which would not
be updated by the concurrent thread).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js