Is this code thread-safe? - c++

Let's say we have a thread-safe compare-and-swap function like
long CAS(long * Dest ,long Val ,long Cmp)
which compares Dest and Cmp, copies Val to Dest if comparison is succesful and returns the original value of Dest atomically.
So I would like to ask you if the code below is thread-safe.
while(true)
{
long dummy = *DestVar;
if(dummy == CAS(DestVar,Value,dummy) )
{
break;
}
}
EDIT:
Dest and Val parameters are the pointers to variables that created on the heap.
InterlockedCompareExchange is an example to out CAS function.

Edit. An edit to the question means most of this isn't relevant. Still, I'll leave this as all the concerns in the C# case also carry to the C++ case, but the C++ case brings many more concerns as stated, so it's not entirely irrelevant.
Yes, but...
Assuming you mean that this CAS is atomic (which is the case with C# Interlocked.CompareExchange and with some things available to use in some C++ libraries) the it's thread-safe in and of itself.
However DestVar = Value could be thread-safe in and of itself too (it will be in C#, whether it is in C++ or not is implementation dependent).
In C# a write to an integer is guaranteed to be atomic. As such, doing DestVar = Value will not fail due to something happening in another thread. It's "thread-safe".
In C++ there are no such guarantees, but there are on some processors (in fact, let's just drop C++ for now, there's enough complexity when it comes to the stronger guarantees of C#, and C++ has all of those complexities and more when it comes to these sort of issues).
Now, the use of atomic CAS operations in themselves will always be "thead-safe", but this is not where the complexity of thread safety comes in. It's the thread-safety of combinations of operations that is important.
In your code, at each loop either the value will be atomically over-written, or it won't. In the case where it won't it'll try again and keep going until it does. It could end up spinning for a while, but it will eventually work.
And in doing so it will have exactly the same effect as simple assignment - including the possibility of messing with what's happening in another thread and causing a serious thread-related bug.
Take a look, for comparison, with the answer at Is this use of a static queue thread-safe? and the explanation of how it works. Note that in each case a CAS is either allowed to fail because its failure means another thread has done something "useful" or when it's checked for success more is done than just stopping the loop. It's combinations of CASs that each pay attention to the possible state caused by other operations that allow for lock-free wait-free code that is thread-safe.
And now we've done with that, note also that you couldn't port that directly to C++ (it depends on garbage collection to make some possible ABA scenarios of little consequence, with C++ there are situations where there could be memory leaks). It really does also matter which language you are talking about.

It's impossible to tell, for any environment. You do not define the following:
What are the memory locations of DestVar and Value? On the heap or on the stack? If they are on the stack, then it is thread safe, as there is not another thread that can access that memory location.
If DestVar and Value are on the heap, then are they reference types or value types (have copy by assignment semantics). If the latter, then it is thread safe.
Does CAS synchronize access to itself? In other words, does it have some sort of mutual exclusion strucuture that has allows for only one call at a time? If so, then it is thread-safe.
If any of the conditions mentioned above are untrue, then it is indeterminable whether or not this is all thread safe. With more information about the conditions mentioned above (as well as whether or not this is C++ or C#, yes, it does matter) an answer can be provided.

Actually, this code is kind of broken. Either you need to know how the compiler is reading *DestVar (before or after CAS), which has wildly different semantics, or you are trying to spin on *DestVar until some other thread changes it. It's certainly not the former, since that would be crazy. If it's the latter, then you should use your original code. As it stands, your revision is not thread safe, since it isn't safe at all.

Related

is a concurent write and read to a non-atomic variable of fundamental type without using it undefined behavior?

in a lock-free queue.pop(), I read a trivialy_copyable variable (of integral type) after synchronization with an atomic aquire inside a loop.
Minimized pseudo code:
//somewhere else writePosition.store(...,release)
bool pop(size_t & returnValue){
writePosition = writePosition.load(aquire)
oldReadPosition = readPosition.load(relaxed)
size_t value{};
do{
value = data[oldReadPosition]
newReadPosition = oldReadPosition+1
}while(readPosition.compare_exchange(oldReadPosition, newReadPosition, relaxed)
// here we are owner of the value
returnValue = value;
return true;
}
the memory of data[oldReadPosition] can only be changed iff this value was read from another thread bevor.
read and write Positions are ABA safe.
with a simple copy, value = data[oldReadPosition] the memory of data[oldReadPosition] will not be changed.
but a write thread queue.push(...) can change data[oldReadPosition] while reading, iff another thread has already read oldPosition and changed the readPosition.
it would be a race condition, if you use the value, but is it also a race condition, and thus undefined behavior, when we leave value untouched? the standard is not spezific enough or I donĀ“t understand it.
imo, this should be possible, because it has no effect.
I would be very happy to get an qualified answer to get deeper insights
thanks a lot
Yes, it's UB in ISO C++; value = data[oldReadPosition] in the C++ abstract machine involves reading the value of that object. (Usually that means lvalue to rvalue conversion, IIRC.)
But it's mostly harmless, probably only going to be a problem on machines with hardware race detection (not normal mainstream CPUs, but possibly on C implementations like clang with threadsanitizer).
Another use-case for non-atomic read and then checking for possible tearing is the SeqLock, where readers can prove no tearing by reading the same value from an atomic counter before and after the non-atomic read. It's UB in C++, even with volatile for the non-atomic data, although that may be helpful in making sure the compiler-generated asm is safe. (With memory barriers and current handling of atomics by existing compilers, even non-volatile makes working asm). See Optimal way to pass a few variables between 2 threads pinning different CPUs
atomic_thread_fence is still necessary for a SeqLock to be safe, and some of the necessary ordering of atomic loads wrt. non-atomic may be an implementation detail if it can't sync with something and create a happens-before.
People do use Seq Locks in real life, depending on the fact that real-life compilers de-facto define a bit more behaviour than ISO C++. Or another way to put it is that happen to work for now; if you're careful about what code you put around the non-atomic read it's unlikely for a compiler to be able to do anything problematic.
But you're definitely venturing out past the safe area of guaranteed behaviour, and probably need to understand how C++ compiles to asm, and how asm works on the target platforms you care about; see also Who's afraid of a big bad optimizing compiler? on LWN; it's aimed at Linux kernel code, which is the main user of hand-rolled atomics and stuff like that.

A legitimate use case for volatile in C++?

I'm running a few threads that basically are all returning the same object as a result. Then I wait for all of them to complete, and basically read the results. To avoid needing synchronization, I figured I could just pre-allocate all the result objects in an array or vector and give the threads a pointer to each. At a high level, the code is something like this (simplified):
std::vector<Foo> results(2);
RunThread1(&results[0]);
RunThread2(&results[1]);
WaitForAll();
// Read results
cout << results[0].name << results[1].name;
Basically I'd like to know if there's anything potentially unsafe about this "code". One thing I was wondering is whether the vector should declared volatile such that the reads at the end aren't optimized and output an incorrect value.
The short answer to your question is no, the array should not be declared volatile. For two simple reasons:
Using volatile is not necessary. Every sane multithreading platform provides synchronization primitives with well-defined semantics. If you use them, you don't need volatile.
Using volatile isn't sufficient. Since volatile doesn't have defined multithread semantics on any platform you are likely to use, it alone is not enough to provide synchronization.
Most likely, whatever you do in WaitForAll will be sufficient. For example, if it uses an event, mutex, condition variable, or almost anything like that, it will have defined multithreading semantics that are sufficient to make this safe.
Update: "Just out curiosity, what would be an example of something that happens in the WaitForAll that guarantees safety of the read? Wouldn't it need to effectively tell the compiler somehow to "flush" the cache or avoid optimizations of subsequent read operations?"
Well, if you're using pthreads, then if it uses pthread_join that would be an example that guarantees safety of the read because the documentation says that anything the thread does is visible to a thread that joins it after pthread_join returns.
How it does it is an implementation detail. In practice, on modern systems, there are no caches to flush nor are there any optimizations of subsequent reads that are possible but need to be avoided.
Consider if somewhere deep inside WaitForAll, there's a call to pthread_join. Generally, you simply don't let the compiler see into the internals of pthread_join and thus the compiler has to assume that pthread_join might do anything that another thread could do. So keeping information that another thread might modify in a register across a call to pthread_join would be illegal because pthread_join itself might access or modify that data.
I was wondering is whether the vector should declared volatile such that the reads at the end aren't optimized and output an incorrect value.
No. If there was problem with lack of synchronisation, then volatile would not help.
But there is no problem with lack of synchronisation, since according to your description, you don't access the same object from multiple threads - until you've waited for the threads to complete, which is something that synchronises the threads.
There is a potential problem that if the objects are small (less than about 64 bytes; depending on CPU architecture), then the objects in the array share a cache "line", and access to them may become effectively synchronised due to write contention. This is only a problem if the threads write to the variable a lot in relation to operations that don't access the output object.
It depends on what's in WaitForAll(). If it's a proper synchronization, all is good. Mutexes, for example, or thread join, will result in the proper memory synchronization being put in.
Volatile would not help. It may prevent compiler optimizations, but would not affect anything happening at the CPU level, like caches not being updated. Use proper synchronization, like mutexes, thread join, and then the result will be valid (sequentially coherent). Don't count on silver bullet volatile. Compilers and CPUs are now complex enough that it won't be guaranteed.
Other answers will elaborate on the memory fences and other instructions that the synchronization will put in. :-)

Are mutex lock functions sufficient without volatile?

A coworker and I write software for a variety of platforms running on x86, x64, Itanium, PowerPC, and other 10 year old server CPUs.
We just had a discussion about whether mutex functions such as pthread_mutex_lock() ... pthread_mutex_unlock() are sufficient by themselves, or whether the protected variable needs to be volatile.
int foo::bar()
{
//...
//code which may or may not access _protected.
pthread_mutex_lock(m);
int ret = _protected;
pthread_mutex_unlock(m);
return ret;
}
My concern is caching. Could the compiler place a copy of _protected on the stack or in a register, and use that stale value in the assignment? If not, what prevents that from happening? Are variations of this pattern vulnerable?
I presume that the compiler doesn't actually understand that pthread_mutex_lock() is a special function, so are we just protected by sequence points?
Thanks greatly.
Update: Alright, I can see a trend with answers explaining why volatile is bad. I respect those answers, but articles on that subject are easy to find online. What I can't find online, and the reason I'm asking this question, is how I'm protected without volatile. If the above code is correct, how is it invulnerable to caching issues?
Simplest answer is volatile is not needed for multi-threading at all.
The long answer is that sequence points like critical sections are platform dependent as is whatever threading solution you're using so most of your thread safety is also platform dependent.
C++0x has a concept of threads and thread safety but the current standard does not and therefore volatile is sometimes misidentified as something to prevent reordering of operations and memory access for multi-threading programming when it was never intended and can't be reliably used that way.
The only thing volatile should be used for in C++ is to allow access to memory mapped devices, allow uses of variables between setjmp and longjmp, and to allow uses of sig_atomic_t variables in signal handlers. The keyword itself does not make a variable atomic.
Good news in C++0x we will have the STL construct std::atomic which can be used to guarantee atomic operations and thread safe constructs for variables. Until your compiler of choice supports it you may need to turn to the boost library or bust out some assembly code to create your own objects to provide atomic variables.
P.S. A lot of the confusion is caused by Java and .NET actually enforcing multi-threaded semantics with the keyword volatile C++ however follows suit with C where this is not the case.
Your threading library should include the apropriate CPU and compiler barriers on mutex lock and unlock. For GCC, a memory clobber on an asm statement acts as a compiler barrier.
Actually, there are two things that protect your code from (compiler) caching:
You are calling a non-pure external function (pthread_mutex_*()), which means that the compiler doesn't know that that function doesn't modify your global variables, so it has to reload them.
As I said, pthread_mutex_*() includes a compiler barrier, e.g: on glibc/x86 pthread_mutex_lock() ends up calling the macro lll_lock(), which has a memory clobber, forcing the compiler to reload variables.
If the above code is correct, how is it invulnerable to caching
issues?
Until C++0x, it is not. And it is not specified in C. So, it really depends on the compiler. In general, if the compiler does not guarantee that it will respect ordering constraints on memory accesses for functions or operations that involve multiple threads, you will not be able to write multithreaded safe code with that compiler. See Hans J Boehm's Threads Cannot be Implemented as a Library.
As for what abstractions your compiler should support for thread safe code, the wikipedia entry on Memory Barriers is a pretty good starting point.
(As for why people suggested volatile, some compilers treat volatile as a memory barrier for the compiler. It's definitely not standard.)
The volatile keyword is a hint to the compiler that the variable might change outside of program logic, such as a memory-mapped hardware register that could change as part of an interrupt service routine. This prevents the compiler from assuming a cached value is always correct and would normally force a memory read to retrieve the value. This usage pre-dates threading by a couple decades or so. I've seen it used with variables manipulated by signals as well, but I'm not sure that usage was correct.
Variables guarded by mutexes are guaranteed to be correct when read or written by different threads. The threading API is required to ensure that such views of variables are consistent. This access is all part of your program logic and the volatile keyword is irrelevant here.
With the exception of the simplest spin lock algorithm, mutex code is quite involved: a good optimized mutex lock/unlock code contains the kind of code even excellent programmer struggle to understand. It uses special compare and set instructions, manages not only the unlocked/locked state but also the wait queue, optionally uses system calls to go into a wait state (for lock) or wake up other threads (for unlock).
There is no way the average compiler can decode and "understand" all that complex code (again, with the exception of the simple spin lock) no matter way, so even for a compiler not aware of what a mutex is, and how it relates to synchronization, there is no way in practice a compiler could optimize anything around such code.
That's if the code was "inline", or available for analyse for the purpose of cross module optimization, or if global optimization is available.
I presume that the compiler doesn't actually understand that
pthread_mutex_lock() is a special function, so are we just protected
by sequence points?
The compiler does not know what it does, so does not try to optimize around it.
How is it "special"? It's opaque and treated as such. It is not special among opaque functions.
There is no semantic difference with an arbitrary opaque function that can access any other object.
My concern is caching. Could the compiler place a copy of _protected
on the stack or in a register, and use that stale value in the
assignment?
Yes, in code that act on objects transparently and directly, by using the variable name or pointers in a way that the compiler can follow. Not in code that might use arbitrary pointers to indirectly use variables.
So yes between calls to opaque functions. Not across.
And also for variables which can only be used in the function, by name: for local variables that don't have either their address taken or a reference bound to them (such that the compiler cannot follow all further uses). These can indeed be "cached" across arbitrary calls include lock/unlock.
If not, what prevents that from happening? Are variations of this
pattern vulnerable?
Opacity of the functions. Non inlining. Assembly code. System calls. Code complexity. Everything that make compilers bail out and think "that's complicated stuff just make calls to it".
The default position of a compiler is always the "let's execute stupidly I don't understand what is being done anyway" not "I will optimize that/let's rewrite the algorithm I know better". Most code is not optimized in complex non local way.
Now let's assume the absolute worse (from out point of view which is that the compiler should give up, that is the absolute best from the point of view of an optimizing algorithm):
the function is "inline" (= available for inlining) (or global optimization kicks in, or all functions are morally "inline");
no memory barrier is needed (as in a mono-processor time sharing system, and in a multi-processor strongly ordered system) in that synchronization primitive (lock or unlock) so it contains no such thing;
there is no special instruction (like compare and set) used (for example for a spin lock, the unlock operation is a simple write);
there is no system call to pause or wake threads (not needed in a spin lock);
then we might have a problem as the compiler could optimize around the function call. This is fixed trivially by inserting a compiler barrier such as an empty asm statement with a "clobber" for other accessible variables. That means that compiler just assumes that anything that might be accessible to a called function is "clobbered".
or whether the protected variable needs to be volatile.
You can make it volatile for the usual reason you make things volatile: to be certain to be able to access the variable in the debugger, to prevent a floating point variable from having the wrong datatype at runtime, etc.
Making it volatile would actually not even fix the issue described above as volatile is essentially a memory operation in the abstract machine that has the semantics of an I/O operation and as such is only ordered with respect to
real I/O like iostream
system calls
other volatile operations
asm memory clobbers (but then no memory side effect is reordered around those)
calls to external functions (as they might do one the above)
Volatile is not ordered with respect to non volatile memory side effects. That makes volatile practically useless (useless for practical uses) for writing thread safe code in even the most specific case where volatile would a priori help, the case where no memory fence is ever needed: when programming threading primitives on a time sharing system on a single CPU. (That may be one of the least understood aspects of either C or C++.)
So while volatile does prevent "caching", volatile doesn't even prevent compiler reordering of lock/unlock operation unless all shared variables are volatile.
Locks/synchronisation primitives make sure the data is not cached in registers/cpu cache, that means data propagates to memory. If two threads are accessing/ modifying data with in locks, it is guaranteed that data is read from memory and written to memory. We don't need volatile in this use case.
But the case where you have code with double checks, compiler can optimise the code and remove redundant code, to prevent that we need volatile.
Example: see singleton pattern example
https://en.m.wikipedia.org/wiki/Singleton_pattern#Lazy_initialization
Why do some one write this kind of code?
Ans: There is a performance benefit of not accuiring lock.
PS: This is my first post on stack overflow.
Not if the object you're locking is volatile, eg: if the value it represents depends on something foreign to the program (hardware state).
volatile should NOT be used to denote any kind of behavior that is the result of executing the program.
If it's actually volatile what I personally would do is locking the value of the pointer/address, instead of the underlying object.
eg:
volatile int i = 0;
// ... Later in a thread
// ... Code that may not access anything without a lock
std::uintptr_t ptr_to_lock = &i;
some_lock(ptr_to_lock);
// use i
release_some_lock(ptr_to_lock);
Please note that it only works if ALL the code ever using the object in a thread locks the same address. So be mindful of that when using threads with some variable that is part of an API.

What are the threading guarantees of nowadays C and C++ compilers?

I'm wondering what are the guarantees that compilers make to ensure that threaded writes to memory have visible effects in other threads.
I know countless cases in which this is problematic, and I'm sure that if you're interested in answering you know it too, but please focus on the cases I'll be presenting.
More precisely, I am concerned about the circumstances that can lead to threads missing memory updates done by other threads. I don't care (at this point) if the updates are non-atomic or badly synchronized: as long as the concerned threads notice the changes, I'll be happy.
I hope that compilers makes the distinction between two kinds of variable accesses:
Accesses to variables that necessarily have an address;
Accesses to variables that don't necessarily have an address.
For instance, if you take this snippet:
void sleepingbeauty()
{
int i = 1;
while (i) sleep(1);
}
Since i is a local, I assume that my compiler can optimize it away, and just let the sleeping beauty fall to eternal slumber.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variable, and generate memory reads to it to ensure that maybe some day the prince will come.
int i = 1;
void sleepingbeauty()
{
while (i) sleep(1);
}
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it instead of caching the value.
void sleepingbeauty(int* ptr)
{
*ptr = 1;
while (*ptr) sleep(1);
}
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
I'm fairly sure that this is the memory access model used by every C and C++ compiler in production out there, but I don't think there are any guarantees. In fact, the C++03 is even blind to the existence of threads, so this question wouldn't even make sense with the standard in mind. I'm not sure about C, though.
Is there some documentation out there that specifies if I'm right or wrong? I know these are muddy waters since these may not be on standards grounds, it seems like an important issue to me.
Besides the compiler generating reads, I'm also worried that the CPU cache could technically retain an outdated value, and that even though my compiler did its best to bring the reads and writes about, the values never synchronise between threads. Can this happen?
Accesses to variables that don't necessarily have an address.
All variables must have addresses (from the language's prospective -- compilers are allowed to avoid giving things addresses if they can, but that's not visible from inside the language). It's a side effect that everything must be "pointerable" that everything has an address -- even the empty class typically has size of at least a char so that a pointer can be created to it.
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variables, and generate memory reads to it to ensure that maybe some day the prince will come.
That depends on the content of onedaymyprincewillcome. The compiler may inline that function if it wishes and still make no memory reads.
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it.
Yes, but it really doesn't matter if there are reads to it. These reads might simply be going to cache on your current local CPU core, not actually going all the way back to main memory. You would need something like a memory barrier for this, and no C++ compiler is going to do that for you.
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
Nope -- not required. The function may be inlined, which would allow the compiler to completely remove these things if it so desires.
The only language feature in the standard that lets you control things like this w.r.t. threading is volatile, which simply requires that the compiler generate reads. That does not mean the value will be consistent though because of the CPU cache issue -- you need memory barriers for that.
If you need true multithreading correctness, you're going to be using some platform specific library to generate memory barriers and things like that, or you're going to need a C++0x compiler which supports std::atomic, which does make these kinds of requirements on variables explicit.
You assume wrong.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
In this code, your compiler will load i from memory each time through the loop. Why? NOT because it thinks another thread could alter its value, but because it thinks that sleep could modify its value. It has nothing to do with whether or not i has an address or must have an address, and everything to do with the operations that this thread performs which could modify the code.
In particular, it is not guaranteed that assigning to an int is even atomic, although this happens to be true on all platforms we use these days.
Too many things go wrong if you don't use the proper synchronization primitives for your threaded programs. For example,
char *str = 0;
asynch_get_string(&str);
while (!str)
sleep(1);
puts(str);
This could (and even will, on some platforms) sometimes print out utter garbage and crash the program. It looks safe, but because you are not using the proper synchronization primitives, the change to ptr could be seen by your thread before the change to the memory location it refers to, even though the other thread initializes the string before setting the pointer.
So just don't, don't, don't do this kind of stuff. And no, volatile is not a fix.
Summary: The basic problem is that the compiler only changes what order the instructions go in, and where the load and store operations go. This is not enough to guarantee thread safety in general, because the processor is free to change the order of loads and stores, and the order of loads and stores is not preserved between processors. In order to ensure things happen in the right order, you need memory barriers. You can either write the assembly yourself or you can use a mutex / semaphore / critical section / etc, which does the right thing for you.
While the C++98 and C++03 standards do not dictate a standard memory model that must be used by compilers, C++0x does, and you can read about it here: http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf
In the end, for C++98 and C++03, it's really up to the compiler and the hardware platform. Typically there will not be any memory barrier or fence-operation issued by the compiler for normally written code unless you use a compiler intrinsic or something from your OS's standard library for synchronization. Most mutex/semaphore implementations also include a built-in memory barrier operation to prevent speculative reads and writes across the locking and unlocking operations on the mutex by the CPU, as well as prevent any re-ordering of operations across the same read or write calls by the compiler.
Finally, as Billy points out in the comments, on Intel x86 and x86_64 platforms, any read or write operation in a single byte increment is atomic, as well as a read or write of a register value to any 4-byte aligned memory location on x86 and 4 or 8-byte aligned memory location on x86_64. On other platforms, that may not be the case and you would have to consult the platform's documentation.
The only control you have over optimisation is volatile.
Compilers make NO gaurantee about concurrent threads accessing the same location at the same time. You will need to some type of locking mechanism.
I can only speak for C and since synchronization is a CPU-implemented functionality a C programmer would need to call a library function for the OS containg an access to the lock (CriticalSection functions in the Windows NT engine) or implement something simpler (such as a spinlock) and access the functionality himself.
volatile is a good property to use at the module level. Sometimes a non-static (public) variable will work too.
local (stack) variables will not be accessible from other threads and should not be.
variables at the module level are good candidates for access by multiple threads but will require synchronizetion functions to work predictably.
Locks are unavoidable but they can be used more or less wisely resulting in a negligible or considerable performance penalty.
I answered a similar question here concerning unsynchronized threads but I think you'll be better off browsing on similar topics to get high-quality answers.
I'm not sure you understand the basics of the topic you claim to be discussing. Two threads, each starting at exactly the same time and looping one million times each performing an inc on the same variable will NOT result in a final value of two million (two * one million increments). The value will end up somewhere in-between one and two million.
The first increment will cause the value to be read from RAM into the L1 (via first the L3 then the L2) cache of the accessing thread/core. The increment is performed and the new value written initially to L1 for propagation to lower caches. When it reaches L3 (the highest cache common to both cores) the memory location will be invalidated to the other core's caches. This may seem safe but in the meantime the other core has simultaneously performed an increment based on the same initial value in the variable. The invalidation from the write by the first value will be superseeded by the write from the second core invalidating the data in the caches of the first core.
Sounds like a mess? It is! The cores are so fast that what happens in the caches falls way behind: the cores are where the action is. This is why you need explicit locks: to make sure that the new value winds up low enough in the memory hierarchy such that other cores will read the new value and nothing else. Or put another way: slow things down so the caches can catch up with the cores.
A compiler does not "feel." A compiler is rule-based and, if constructed correctly, will optimize to the extent that the rules allow and the compiler writers are able to construct the optimizer. If a variable is volatile and the code is multi-threaded the rules won't allow the compiler to skip a read. Simple as that even though on the face of it it may appear devilishly tricky.
I'll have to repeat myself and say that locks cannot be implemented in a compiler because they are specific to the OS. The generated code will call all functions without knowing if they are empty, contain lock code or will trigger a nuclear explosion. In the same way the code will not be aware of a lock being in progress since the core will insert wait states until the lock request has resulted in the lock being in place. The lock is something that exists in the core and in the mind of the programmer. The code shouldn't (and doesn't!) care.
I'm writing this answer because most of the help came from comments to questions, and not always from the authors of the answers. I already upvoted the answers that helped me most, and I'm making this a community wiki to not abuse the knowledge of others. (If you want to upvote this answer, consider also upvoting Billy's and Dietrich's answers too: they were the most helpful authors to me.)
There are two problems to address when values written from a thread need to be visible from another thread:
Caching (a value written from a CPU could never make it to another CPU);
Optimizations (a compiler could optimize away the reads to a variable if it feels it can't be changed).
The first one is rather easy. On modern Intel processors, there is a concept of cache coherence, which means changes to a cache propagate to other CPU caches.
Turns out the optimization part isn't too hard either. As soon as the compiler cannot guarantee that a function call cannot change the content of a variable, even in a single-threaded model, it won't optimize the reads away. In my examples, the compiler doesn't know that sleep cannot change i, and this is why reads are issued at every operation. It doesn't need to be sleep though, any function for which the compiler doesn't have the implementation details would do. I suppose that a particularly well-suited function to use would be one that emits a memory barrier.
In the future, it's possible that compilers will have better knowledge of currently impenetrable functions. However, when that time will come, I expect that there will be standard ways to ensure that changes are propagated correctly. (This is coming with C++11 and the std::atomic<T> class. I don't know for C1x.)

c++ volatile multithreading variables

I'm writing a C++ app.
I have a class variable that more than one thread is writing to.
In C++, anything that can be modified without the compiler "realizing" that it's being changed needs to be marked volatile right? So if my code is multi threaded, and one thread may write to a var while another reads from it, do I need to mark the var volaltile?
[I don't have a race condition since I'm relying on writes to ints being atomic]
Thanks!
C++ hasn't yet any provision for multithreading. In practice, volatile doesn't do what you mean (it has been designed for memory adressed hardware and while the two issues are similar they are different enough that volatile doesn't do the right thing -- note that volatile has been used in other language for usages in mt contexts).
So if you want to write an object in one thread and read it in another, you'll have to use synchronization features your implementation needs when it needs them. For the one I know of, volatile play no role in that.
FYI, the next standard will take MT into account, and volatile will play no role in that. So that won't change. You'll just have standard defined conditions in which synchronization is needed and standard defined way of achieving them.
Yes, volatile is the absolute minimum you'll need. It ensures that the code generator won't generate code that stores the variable in a register and always performs reads and writes from/to memory. Most code generators can provide atomicity guarantees on variables that have the same size as the native CPU word, they'll ensure the memory address is aligned so that the variable cannot straddle a cache-line boundary.
That is however not a very strong contract on modern multi-core CPUs. Volatile does not promise that another thread that runs on another core can see updates to the variable. That requires a memory barrier, usually an instruction that flushes the CPU cache. If you don't provide a barrier, the thread will in effect keep running until such a flush occurs naturally. That will eventually happen, the thread scheduler is bound to provide one. That can take milliseconds.
Once you've taken care of details like this, you'll eventually have re-invented a condition variable (aka event) that isn't likely to be any faster than the one provided by a threading library. Or as well tested. Don't invent your own, threading is hard enough to get right, you don't need the FUD of not being sure that the very basic primitives are solid.
volatile instruct the compiler not to optimize upon "intuition" of a variable value or usage since it could be optimize "from the outside".
volatile won't provide any synchronization however and your assumption of writes to int being atomic are all but realistic!
I'd guess we'd need to see some usage to know if volatile is needed in your case (or check the behavior of your program) or more importantly if you see some sort of synchronization.
I think that volatile only really applies to reading, especially reading memory-mapped I/O registers.
It can be used to tell the compiler to not assume that once it has read from a memory location that the value won't change:
while (*p)
{
// ...
}
In the above code, if *p is not written to within the loop, the compiler might decide to move the read outside the loop, more like this:
cached_p=*p
while (cached_p)
{
// ...
}
If p is a pointer to a memory-mapped I/O port, you would want the first version where the port is checked before the loop is entered every time.
If p is a pointer to memory in a multi-threaded app, you're still not guaranteed that writes are atomic.
Without locking you may still get 'impossible' re-orderings done by the compiler or processor. And there's no guarantee that writes to ints are atomic.
It would be better to use proper locking.
Volatile will solve your problem, ie. it will guarantee consistency among all the caches of the system. However it will be inefficiency since it will update the variable in memory for each R or W access. You might concider using a memory barrier, only whenever it is needed, instead.
If you are working with or gcc/icc have look on sync built-ins : http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
EDIT (mostly about pm100 comment):
I understand that my beliefs are not a reference so I found something to quote :)
The volatile keyword was devised to prevent compiler optimizations that might render code incorrect in the presence of certain asynchronous events. For example, if you declare a primitive variable as volatile, the compiler is not permitted to cache it in a register
From Dr Dobb's
More interesting :
Volatile fields are linearizable. Reading a volatile field is like acquiring a lock; the working memory is invalidated and the volatile field's current value is reread from memory. Writing a volatile field is like releasing a lock : the volatile field is immediately written back to memory.
(this is all about consistency, not about atomicity)
from The Art of multiprocessor programming, Maurice Herlihy & Nir Shavit
Lock contains memory synchronization code, if you don't lock, you must do something and using volatile keyword is probably the simplest thing you can do (even if it was designed for external devices with memory binded to the address space, it's not the point here)