Are volatile reads and writes atomic on Windows+VisualC?

Are volatile reads and writes atomic on Windows+VisualC? - c++

There are a couple of questions on this site asking whether using a volatile variable for atomic / multithreaded access is possible: See here, here, or here for example.
Now, the C(++) standard conformant answer is obviously no.
However, on Windows & Visual C++ compiler, the situation seems not so clear.
I have recently answered and cited the official MSDN docs on volatile
Microsoft Specific
Objects declared as volatile are (...)
A write to a volatile object (volatile write) has Release semantics;
a reference to a global or static object? that occurs before a write to
a volatile object in the instruction sequence will occur before that
volatile write in the compiled binary.
A read of a volatile object (volatile read) has Acquire semantics; a reference to a
global or static object? that occurs after a read of volatile memory in the
instruction
sequence will occur after that volatile read in the compiled binary.
This allows volatile objects to be used for memory locks and releases in multithreaded applications.
[emphasis mine]
Now, reading this, it would appear to me that a volatile variable will be treated by the MS compiler as std::atomic would be in the upcoming C++11 standard.
However, in a comment to my answer, user Hans Passant wrote "That MSDN article is very unfortunate, it is dead wrong. You can't implement a lock with volatile, not even with Microsoft's version. (...)"
Please note: The example given in the MSDN seems pretty fishy, as you cannot generally implement a lock without atomic exchange. (As also pointed out by Alex.) This still leaves the question wrt. to the validity of the other infos given in this MSDN article, especially for use cases like here and here.)
Additionally, there are the docs for The Interlocked* functions, especially InterlockedExchange with takes a volatile(!?) variable and does an atomic read+write. (Note that one question we have on SO -- When should InterlockedExchange be used? -- does not authoritatively answer whether this function is needed for a read-only or write-only atomic access.)
What's more, the volatile docs quoted above somehow allude to "global or static object", where I would have thought that "real" acquire/release semantics should apply to all values.
Back to the question
On Windows, with Visual C++ (2005 - 2010), will declaring a (32bit? int?) variable as volatile allow for atomic reads and writes to this variable -- or not?
What is especially important to me is that this should hold (or not) on Windows/VC++ independently of the processor or platform the program runs on. (That is, does it matter whether it's a WinXP/32bit or a Windows 2008R2/64bit running on Itanum2?)
Please back up your answer with verifiable information, links, test-cases!

Yes they are atomic on windows/vc++ (Assuming you meet alignment requirements etc or course)
However for a lock you would need an atomic test and set, or compare and exchange instuction or similar, not just an atomic update or read.
Otherwise there is no way to test the lock and claim it in one indivisable operation.
EDIT: As commented below, all aligned memory accesses on x86 of 32bit or below are atomic anyway. The key point is that volatile makes the memory accesses ordered. (Thanks for pointing this out in the comments)

As of Visual C++ 2005 volatile variables are atomic. But this only applies to this specific class of compilers and to x86/AMD64 platforms. PowerPC for example may reorder memory reads/writes and would require read/write barriers. I'm not familar what the semantics are for gcc-class compilers, but in any case using volatile for atomics is not very portable.
reference, see first remark "Microsoft Specific": http://msdn.microsoft.com/en-us/library/12a04hfd%28VS.80%29.aspx

A bit off-topic, but let's have a go anyway.
... there are the docs for The Interlocked* functions, especially InterlockedExchange which takes a volatile(!) variable ...
If you think about this:
void foo(int volatile*);
Does it say:
the argument must be a pointer to a volatile int, or
the argument may as well be a pointer to a volatile int?
The latter is the correct answer, since the function can be passed both pointers to volatile and non-volatile int's.
Hence, the fact that InterlockedExchangeX() has its argument volatile-qualified does not imply that it must operate on volatile integers only.

The point is probably to allow stuff like
singleton& get_instance()
{
static volatile singleton* instance;
static mutex instance_mutex;
if (!instance)
{
raii_lock lock(instance_mutex);
if (!instance) instance = new singleton;
}
return *instance;
}
which would break if instance was written to before initialization was complete. With MSVC semantics, you are guaranteed that as soon as you see instance != 0, the object has finished being initialized (which is not the case without proper barrier semantics, even with traditional volatile semantics).
This double-checked lock (anti-)pattern is quite common actually, and broken if you don't provide barrier semantics. However, if there are guarantees that accesses to volatile variables are acquire + release barriers, then it works.
Don't rely on such custom semantics of volatile though. I suspect this has been introduced not to break existing codebases. In any way, don't write locks according to MSDN example. It probably doesn't work (I doubt you can write a lock using just a barrier: you need atomic operations -- CAS, TAS, etc -- for that).
The only portable way to write the double-checked lock pattern is to use C++0x, which provides a suitable memory model, and explicit barriers.

under x86, these operations are guaranteed to be atomic without the need for LOCK based instructions such as Interlocked* (see intel's developer manuals 3A section 8.1):
basic memory operations will always be carried out atomically:
•
Reading or writing a byte
• Reading or writing a word aligned on a
16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees
that the following additional memory operations will always be carried
out atomically:
• Reading or writing a quadword aligned on a 64-bit
boundary
• 16-bit accesses to uncached memory locations that fit
within a 32-bit data bus
The P6 family processors (and newer
processors since) guarantee that the following additional memory
operation will always be carried out atomically:
• Unaligned 16-, 32-,
and 64-bit accesses to cached memory that fit within a cache line
This means volatile will only every serve to prevent caching and instruction reordering by the compiler (MSVC won't emit atomic operations for volatile variables, they need to be explicitly used).

Related

Is memory barrier or atomic operation required in a busy-wait loop?

Consider the following spin_lock() implementation, originally from this answer:
void spin_lock(volatile bool* lock) {
for (;;) {
// inserts an acquire memory barrier and a compiler barrier
if (!__atomic_test_and_set(lock, __ATOMIC_ACQUIRE))
return;
while (*lock) // no barriers; is it OK?
cpu_relax();
}
}
What I already know:
volatile prevents compiler from optimizing out *lock re-read on each iteration of the while loop;
volatile inserts neither memory nor compiler barriers;
such an implementation actually works in GCC for x86 (e.g. in Linux kernel) and some other architectures;
at least one memory and compiler barrier is required in spin_lock() implementation for a generic architecture; this example inserts them in __atomic_test_and_set().
Questions:
Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?
1.1 According to C++ standards?
1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?
Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)
Is the while loop safe according to C++11 and its memory model?
There are several related questions, but I was unable to construct an explicit and unambiguous answer from them:
Q: Memory barrier in a single thread
In principle: Yes, if program execution moves from one core to the next, it might not see all writes that occurred on the previous core.
Q: memory barrier and cache flush
On pretty much all modern architectures, caches (like the L1 and L2 caches) are ensured coherent by hardware. There is no need to flush any cache to make memory visible to other CPUs.
Q: Is my spin lock implementation correct and optimal?
Q: Do spin locks always require a memory barrier? Is spinning on a memory barrier expensive?
Q: Do you expect that future CPU generations are not cache coherent?

This important: in C++ volatile has nothing at all to do with concurrency! The purpose of volatile is to tell the compiler that it shall not optimize accesses to the affected object. It does not tell the CPU anything, primarily because the CPU would know already whether memory would be volatile or not. The purpose of volatile is effectively to deal with memory mapped I/O.
The C++ standard is very clear in section 1.10 [intro.multithread] that unsynchronized access to an object which is modified in one thread and is accessed (modified or read) in another thread is undefined behavior. The synchronization primitives avoiding undefined behavior are library components like the atomic classes or mutexes. This clause mentions volatile only in the context of signals (i.e., as volatile sigatomic_t) and in the context of forward progress (i.e., that a thread will eventually do something which has an observable effect like accessing a volatile object or doing I/O). There is no mention of volatile in conjunction with synchronization.
Thus, unsynchronized assess to a variable shared across threads leads to undefined behavior. Whether it is declared volatile or not doesn't matter to this undefined behavior.

From the Wikipedia page on memory barriers:
... Other architectures, such as the Itanium, provide separate "acquire" and "release" memory barriers which address the visibility of read-after-write operations from the point of view of a reader (sink) or writer (source) respectively.
To me this implies that Itanium requires a suitable fence to make reads/writes visible to other processors, but this may in fact just be for purposes of ordering. The question, I think, really boils down to:
Does there exist an architecture where a processor might never update its local cache if not instructed to do so? I don't know the answer, but if you posed the question in this form then someone else might. In such an architecture your code potentially goes into an infinite loop where the read of *lock always sees the same value.
In terms of general C++ legality, the one atomic test and set in your example isn't enough, since it implements only a single fence which will allow you to see the initial state of the *lock when entering the while loop but not to see when it changes (which results in undefined behavior, since you are reading a variable that is changed in another thread without synchronisation) - so the answer to your question (1.1/3) is no.
On the other hand, in practice, the answer to (1.2/2) is yes (given GCC's volatile semantics), so long as the architecture guarantees cache coherence without explicit memory fences, which is true of x86 and probably for many architectures but I can't give a definite answer on whether it is true for all architectures that GCC supports. It is however generally unwise to knowingly rely on particular behavior of code that is technically undefined behavior according to the language spec, especially if it is possible to get the same result without doing so.
Incidentally, given that memory_order_relaxed exists, there seems little reason not to use it in this case rather than try to hand-optimise by using non-atomic reads, i.e. changing the while loop in your example to:
while (atomic_load_explicit(lock, memory_order_relaxed)) {
cpu_relax();
}
On x86_64 for instance the atomic load becomes a regular mov instruction and the optimised assembly output is essentially the same as it would be for your original example.

Is volatile enough here or are there any architectures or compilers where memory or compiler barrier or atomic operation is required in the while loop?
will the volatile code see the change. Yes, but not necessarily as quickly as if there was a memory barrier. At some point, some form of synchronization will occur, and the new state will be read from the variable, but there are no guarantees on how much has happened elsewhere in the code.
1.1 According to C++ standards?
From cppreference : memory_order
It is the memory model and memory order which defines the generalized hardware that the code needs to work on. For a message to pass between threads of execution, an inter-thread-happens-before relationship needs to occur. This requires either...
A synchronizes-with B
A has a std::atomic operation before B
A indirectly synchronizes with B (through X).
A is sequenced before X which inter-thread happens before B
A interthread happens before X and X interthread happens before B.
As you are not performing any of those cases there will be forms of your program where on some current hardware, it may fail.
In practice, the end of a time-slice will cause the memory to become coherent, or any form of barrier on the non-spinlock thread will ensure that the caches are flushed.
Not sure on the causes of the volatile read getting the "current value".
1.2 In practice, for known architectures and compilers, specifically for GCC and platforms it supports?
As the code is not consistent with the generalized CPU, from C++11 then it is likely this code will fail to perform with versions of C++ which try to adhere to the standard.
From cppreference : const volatile qualifiers
Volatile access stops optimizations from moving work from before it to after it, and from after it to before it.
"This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution"
So an implementation has to ensure that instructions are read from the memory location rather than any local copy. But it does not have to ensure that the volatile write is flushed through the caches to produce a coherent view across all the CPUs. In this sense, there is no time boundary on how long after a write into a volatile variable will become visible to another thread.
Also see kernel.org why volatile is nearly always wrong in kernel
Is this implementation safe on all architectures supported by GCC and Linux? (It is at least inefficient on some architectures, right?)
There is no guarantee the volatile message gets out of the thread which sets it. So not really safe. On linux it may be safe.
Is the while loop safe according to C++11 and its memory model?
No - as it doesn't create any of the inter-thread messaging primitives.

`volatile` to sync variable between threads

I have a variable int foo that is accessed from two threads. Assuming I have no race-condition issues (access is protected by a mutex, all operations are atomic, or whatever other method to protect from race conditions), there is still the issue of "register caching" (for lack of a better name), where the compiler may assume that if the variable is read twice without being written in between, it is the same value, and so may "optimize" away things like:
while(foo) { // <-may be optimized to if(foo) while(1)
do-something-that-doesn't-involve-foo;
}
or
if(foo) // becomes something like (my assembly is very rusty): mov ebx, [foo]; cmp ebx, 0; jz label;
do-something-that-doesn't-involve-foo;
do-something-else-that-doesn't-involve-foo;
if(foo) // <-may be optimized to jz label2;
do-something;
does marking foo as volatile solve this issue? Are changes from one thread guaranteed to reach the other thread?
If not, what other way is there to do this? I need a solution for Linux/Windows (possibly separate solutions), no C++11.

What you need are memory barriers.
MemoryBarrier();
or
__sync_synchronize();
Edit: I've bolded the interesting part and here's the link to the wiki article (http://en.wikipedia.org/wiki/Memory_barrier#cite_note-1) and the associated reference (http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf)
Here's the answer to your other question (from wikipedia):
In C and C++, the volatile keyword was intended to allow C and C++ programs to directly access memory-mapped I/O. Memory-mapped I/O generally requires that the reads and writes specified in source code happen in the exact order specified with no omissions. Omissions or reorderings of reads and writes by the compiler would break the communication between the program and the device accessed by memory-mapped I/O. A C or C++ compiler may not reorder reads and writes to volatile memory locations, nor may it omit a read or write to a volatile memory location. The keyword volatile does not guarantee a memory barrier to enforce cache-consistency. Therefore the use of "volatile" alone is not sufficient to use a variable for inter-thread communication on all systems and processors[1]
Check this one out, it provides great explanations on the subject:
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2

If access is protected by a mutex, you do not have any issue to worry about. The volatile keyword is useless here. A mutex is a full memory barrier and thus no object whose address could be externally visible can be cached across the mutex lock or unlock calls.

The volatile keyword was originally introduced to indicate that the value can be changed by the hardware. This happens when hardware device has memory mapped registers or memory buffers. The right approach is to use it only for this purpose.
All modern synchronization language constructs and synchronization libraries are not using this keyword. Application level programmers should do the same thing.

Can compiler sometimes cache variable declared as volatile

From what I know, the compiler never optimizes a variable that is declared as volatile. However, I have an array declared like this.
volatile long array[8];
And different threads read and write to it. An element of the array is only modified by one of the threads and read by any other thread. However, in certain situations I've noticed that even if I modify an element from a thread, the thread reading it does not notice the change. It keeps on reading the same old value, as if compiler has cached it somewhere. But compiler in principal should not cache a volatile variable, right? So how come this is happening.
NOTE: I am not using volatile for thread synchronization, so please stop giving me answers such as use a lock or an atomic variable. I know the difference between volatile, atomic variables and mutexes. Also note that the architecture is x86 which has proactive cache coherence. Also I read the variable for long enough after it is supposedly modified by the other thread. Even after a long time, the reading thread can't see the modified value.

But compiler in principal should not cache a volatile variable, right?
No, the compiler in principle must read/write the address of the variable each time you read/write the variable.
[Edit: At least, it must do so up to the point at which the the implementation believes that the value at that address is "observable". As Dietmar points out in his answer, an implementation might declare that normal memory "cannot be observed". This would come as a surprise to people using debuggers, mprotect, or other stuff outside the scope of the standard, but it could conform in principle.]
In C++03, which does not consider threads at all, it is up to the implementation to define what "accessing the address" means when running in a thread. Details like this are called the "memory model". Pthreads, for example, allows per-thread caching of the whole of memory, including volatile variables. IIRC, MSVC provides a guarantee that volatile variables of suitable size are atomic, and it will avoid caching (rather, it will flush as far as a single coherent cache for all cores). The reason it provides that guarantee is because it's reasonably cheap to do so on Intel -- Windows only really cares about Intel-based architectures, whereas Posix concerns itself with more exotic stuff.
C++11 defines a memory model for threading, and it says that this is a data race (i.e. that volatile does not ensure that a read in one thread is sequenced relative to a write in another thread). Two accesses can be sequenced in a particular order, sequenced in unspecified order (the standard might say "indeterminate order", I can't remember), or not sequenced at all. Not sequenced at all is bad -- if either of two unsequenced accesses is a write then behavior is undefined.
The key here is the implied "and then" in "I modify an element from a thread AND THEN the thread reading it does not notice the change". You're assuming that the operations are sequenced, but they're not. As far as the reading thread is concerned, unless you use some kind of synchronization the write in the other thread hasn't necessarily happened yet. And actually it's worse than that -- you might think from what I just wrote that it's only the order of operations that is unspecified, but actually the behavior of a program with a data race is undefined.

C
What volatile does:
Guarantees an up-to-date value in the variable, if the variable is modified from an external source (a hardware register, an interrupt, a different thread, a callback function etc).
Blocks all optimizations of read/write access to the variable.
Prevent dangerous optimization bugs that can happen to variables shared between several threads/interrupts/callback functions, when the compiler does not realize that the thread/interrupt/callback is called by the program. (This is particularly common among various questionable embedded system compilers, and when you get this bug it is very hard to track down.)
What volatile does not:
It does not guarantee atomic access or any form of thread-safety.
It cannot be used instead of a mutex/semaphore/guard/critical section. It cannot be used for thread synchronization.
What volatile may or may not do:
It may or may not be implemented by the compiler to provide a memory barrier, to protect against instruction cache/instruction pipe/instruction re-ordering issues in a multi-core environment. You should never assume that volatile does this for you, unless the compiler documentation explicitly states that it does.

With volatile you can only impose that a variable is re-read whenever you use its value. It doesn't guarantee that the different values/representations that are present on different levels of your architecture are consistent.
To have such gurantees you'd need the new utilities from C11 and C++1 concerning atomic access and memory barriers. Many compilers implement these already in terms of extension. E.g the gcc family (clang, icc, etc) have builtins starting with prefix __sync to implement these.

Volatile Keyword only guarantees that the compiler will not use register for this variable. Thus every access to this variable will go and read the memory location. Now, I assume that you have cache coherence among the multiple processors in your architecture. So if one processor writes and other reads it, then it should be visible under normal conditions. However, you should consider the corner cases. Suppose the variable is in the pipeline of one processor core and other processor is trying to read it assuming that has been written, then there is a problem. So essentially, the shared variables should be either guarded by locks or should be protected by using barrier mechanism correctly.

The semantics of volatile are implementation-defined. If a compiler knew that interrupts would be disabled while certain piece of code was executed, and knew that on the target platform there would be no means other than interrupt handlers via which operations on certain storage would be observable, it could register-cache volatile-qualified variables within such storage just the same as it could cache ordinary variables, provided it documented such behavior.
Note that what aspects of behavior are counted as "observable" may be defined in some measure by the implementation. If an implementation documents that it is not intended for use on hardware which uses main RAM accesses to trigger required externally-visible actions, then accesses to main RAM would not be "observable" on that implementation. The implementation would be compatible with hardware which was capable of physically observing such accesses, if nothing cared whether any such accesses were actually seen. If such accesses were required, however, as they would be if the accesses were regarded as "observable", however, the compiler would not be claiming compatibility and would thus make no promise about anything.

For C++:
From what I know, the compiler never optimizes a variable that is declared as volatile.
Your premise is wrong. volatile is a hint to the compiler and doesn't actually guarantee anything. Compilers can choose to prevent some optimizations on volatile variables, but that's it.
volatile isn't a lock, don't try to use it as such.
7.1.5.1
7) [ Note: volatile is a hint to the implementation to avoid
aggressive optimization involving the object because the value of the
object might be changed by means undetectable by an implementation.
See 1.9 for detailed semantics. In general, the semantics of volatile
are intended to be the same in C++ as they are in C. —end note]

Volatile only affects the variable it is in front of. Here in your example, a pointer. Your code: volatile long array[8], the pointer to the first element of the array is volatile, not it's content. (same for objects of any kind)
you could adapt it like in
How do I declare an array created using malloc to be volatile in c++

The volatile keyword has nothing to do with concurrency in C++ at all! It is used to have the compiler prevented from making use of the previous value, i.e., the compiler will generate code accessing a volatile value every time is accessed in the code. The main purpose are things like memory mapped I/O. However, use of volatile has no affect on what the CPU does when reading normal memory: If the CPU has no reason to believe that the value changed in memory, e.g., because there is no synchronization directive, it can just use the value from its cache. To communicate between threads you need some synchronization, e.g., an std::atomic<T>, lock a std::mutex, etc.

C++ accesses by volatile lvalues and C accesses to volatile objects are "abstractly" "observable"--although in practice C behaviour is per the C++ standard not the C standard. Informally, the volatile declaration tells every thread the value might change somehow regardless of the text in any thread. Under the Standards with threads, there isn't any notion of a write by another thread causing a change in an object, volatile or not, shared or not, except for a shared variable by the synchronizing function call at the start of a synchronized critical region. volatile is irrelevant to thread shared objects.
If your code doesn't properly synchronize the thread you are talking about, your one thread reading what an other thread wrote has undefined behaviour. So the compiler can generate any code it wants. If your code is properly synchronized then writes by other threads only happen at thread synchronization calls; you don't need volatile for that.
PS
The standards say "What constitutes an access to an object that
has volatile-qualified type is implementation-defined." So you can't just assume that there is a read access for every dereferencing of a volatile lvalue or a write acces for every assignment through one.
Moreover how ("abstract") "observable" volatile accesses are "actually" manifested is implementation defined. So a compiler might not generate code for hardware accesses corresponding to the defined abstract accesses. Eg maybe only objects with static storage duration and external linkage compiled with a certain flag for linking into special hardware locations can ever be changed from outside the program text, so that other objects' volatile is ignored.

However, in certain situations I've noticed that even if I modify an
element from a thread, the thread reading it does not notice the
change. It keeps on reading the same old value, as if compiler has
cached it somewhere.
This is not because the compiler cached it somewhere, but because the reading thread reads from its CPU core's cache, which might be different from the writing thread's one. To ensure value change propagation across CPU cores, you need to use proper memory fences, and you neither can nor need to use volatile for that in C++.

volatile variable and atomic operations on Visual C++ x86

Plain load has acquire semantics on x86, plain store has release semantics, however compiler still can reorder instructions. While fences and locked instructions (locked xchg, locked cmpxchg) prevent both hardware and compiler from reordering, plain loads and stores are still necessary to protect with compiler barriers. Visual C++ provides _ReadWriterBarrier() function, which prevents compiler from reordering, also C++ provides volatile keyword for the same reason. I write all this information just to make sure that I get everything right. So all written above is true, is there any reason to mark as volatile variables which are going to be used in functions protected with _ReadWriteBarrier()?
For example:
int load(int& var)
{
_ReadWriteBarrier();
T value = var;
_ReadWriteBarrier();
return value;
}
Is it safe to make that variable non-volatile? As far as I understand it is, because function is protected and no reordering could be done by compiler inside. On the other hand Visual C++ provides special behavior for volatile variables (different from the one that standard does), it makes volatile reads and writes atomic loads and stores, but my target is x86 and plain loads and stores are supposed to be atomic on x86 anyway, right?
Thanks in advance.

Volatile keyword is available in C too. "volatile" is often used in embedded System, especially when value of the variable may change at any time-without any action being taken by the code - three common scenarios include reading from a memory-mapped peripheral register or global variables either modified by an interrupt service routine or those within a multi-threaded program.
So it is the last scenario where volatile could be considered to be similar to _ReadWriteBarrier.
_ReadWriteBarrier is not a function - _ReadWriteBarrier does not insert any additional instructions, and it does not prevent the CPU from rearranging reads and writes— it only prevents the compiler from rearranging them. _ReadWriteBarrier is to prevent compiler reordering.
MemoryBarrier is to prevent CPU reordering!
A compiler typically rearranges instructions... C++ does not contain built-in support for multithreaded programs so the compiler assumes the code is single-threaded when reordering the code. With MSVC use _ReadWriteBarrier in the code, so that the compiler will not move reads and writes across it.
Check this link for more detailed discussion on those topics
http://msdn.microsoft.com/en-us/library/ee418650(v=vs.85).aspx
Regarding your code snippet - you do not have to use ReadWriteBarrier as a SYNC primitive - the first call to _ReadWriteBarrier is not necessary.
When using ReadWriteBarrier you do not have to use volatile
You wrote "it makes volatile reads and writes atomic loads and stores" - I don't think that is OK to say that, Atomicity and volatility are different. Atomic operations are considered to be indivisible - ... http://www.yoda.arachsys.com/csharp/threads/volatility.shtml

Note: I am not an expert on this topic, some of my statements are "what I heard on the internet", but I think I csan still clear up some misconceptions.
[edit] In general, I would rely on platform-specifics such as x86 atomic reads and lack of OOOX only in isolated, local optimizations that are guarded by an #ifdef checking the target platform, ideally accompanied by a portable solution in the #else path.
Things to look out for
atomicity of read / write operations
reordering due to compiler optimizations (this includes a different order seen by another thread due to simple register caching)
out-of-order execution in the CPU
Possible misconceptions
1. As far as I understand it is, because function is protected and no reordering could be done by compiler inside.
[edit] To clarify: the _ReadWriteBarrier provides protection against instruction reordering, however, you have to look beyond the scope of the function. _ReadWriteBarrier has been fixed in VS 2010 to do that, earlier versions may be broken (depending on the optimizations they actually do).
Optimization isn't limited to functions. There are multiple mechanisms (automatic inlining, link time code generation) that span functions and even compilation units (and can provide much more significant optimizations than small-scoped register caching).
2. Visual C++ [...] makes volatile reads and writes atomic loads and stores,
Where did you find that? MSDN says that beyond the standard, will put memory barriers around reads and writes, no guarantee for atomic reads.
[edit] Note that C#, Java, Delphi etc. have different memory mdoels and may make different guarantees.
3. plain loads and stores are supposed to be atomic on x86 anyway, right?
No, they are not. Unaligned reads are not atomic. They happen to be atomic if they are well-aligned - a fact I'd not rely on unless it's isolated and easily exchanged. Otherwise your "simplificaiton fo x86" becomes a lockdown to that target.
[edit] Unaligned reads happen:
char * c = new char[sizeof(int)+1];
load(*(int *)c); // allowed by standard to be unaligned
load(*(int *)(c+1)); // unaligned with most allocators
#pragma pack(push,1)
struct
{
char c;
int i;
} foo;
load(foo.i); // caller said so
#pragma pack(pop)
This is of course all academic if you remember the parameter must be aligned, and you control all code. I wouldn't write such code anymore, because I've been bitten to often by laziness of the past.
4. Plain load has acquire semantics on x86, plain store has release semantics
No. x86 processors do not use out-of-order execution (or rather, no visible OOOX - I think), but this doesn't stop the optimizer from reordering instructions.
5. _ReadBarrier / _WriteBarrier / _ReadWriteBarrier do all the magic
They don't - they just prevent reordering by the optimizer. MSDN finally made it a big bad warning for VS2010, but the information apparently applies to previous versions as well.
Now, to your question.
I assume the purpose of the snippet is to pass any variable N, and load it (atomically?) The straightforward choice would be an interlocked read or (on Visual C++ 2005 and later) a volatile read.
Otherwise you'd need a barrier for both compiler and CPU before the read, in VC++ parlor this would be:
int load(int& var)
{
// force Optimizer to complete all memory writes:
// (Note that this had issues before VC++ 2010)
_WriteBarrier();
// force CPU to settle all pending read/writes, and not to start new ones:
MemoryBarrier();
// now, read.
int value = var;
return value;
}
Noe that _WriteBarrier has a second warning in MSDN:
*In past versions of the Visual C++ compiler, the _ReadWriteBarrier and _WriteBarrier functions were enforced only locally and did not affect functions up the call tree. These functions are now enforced all the way up the call tree.*
I hope that is correct. stackoverflowers, please correct me if I'm wrong.

Why is volatile not considered useful in multithreaded C or C++ programming?

As demonstrated in this answer I recently posted, I seem to be confused about the utility (or lack thereof) of volatile in multi-threaded programming contexts.
My understanding is this: any time a variable may be changed outside the flow of control of a piece of code accessing it, that variable should be declared to be volatile. Signal handlers, I/O registers, and variables modified by another thread all constitute such situations.
So, if you have a global int foo, and foo is read by one thread and set atomically by another thread (probably using an appropriate machine instruction), the reading thread sees this situation in the same way it sees a variable tweaked by a signal handler or modified by an external hardware condition and thus foo should be declared volatile (or, for multithreaded situations, accessed with memory-fenced load, which is probably a better a solution).
How and where am I wrong?

The problem with volatile in a multithreaded context is that it doesn't provide all the guarantees we need. It does have a few properties we need, but not all of them, so we can't rely on volatile alone.
However, the primitives we'd have to use for the remaining properties also provide the ones that volatile does, so it is effectively unnecessary.
For thread-safe accesses to shared data, we need a guarantee that:
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to be read. In our code, we simply set the flag after preparing the data, so all looks fine. But what if the instructions are reordered so the flag is set first?
volatile does guarantee the first point. It also guarantees that no reordering occurs between different volatile reads/writes. All volatile memory accesses will occur in the order in which they're specified. That is all we need for what volatile is intended for: manipulating I/O registers or memory-mapped hardware, but it doesn't help us in multithreaded code where the volatile object is often only used to synchronize access to non-volatile data. Those accesses can still be reordered relative to the volatile ones.
The solution to preventing reordering is to use a memory barrier, which indicates both to the compiler and the CPU that no memory access may be reordered across this point. Placing such barriers around our volatile variable access ensures that even non-volatile accesses won't be reordered across the volatile one, allowing us to write thread-safe code.
However, memory barriers also ensure that all pending reads/writes are executed when the barrier is reached, so it effectively gives us everything we need by itself, making volatile unnecessary. We can just remove the volatile qualifier entirely.
Since C++11, atomic variables (std::atomic<T>) give us all of the relevant guarantees.

You might also consider this from the Linux Kernel Documentation.
C programmers have often taken volatile to mean that the variable
could be changed outside of the current thread of execution; as a
result, they are sometimes tempted to use it in kernel code when
shared data structures are being used. In other words, they have been
known to treat volatile types as a sort of easy atomic variable, which
they are not. The use of volatile in kernel code is almost never
correct; this document describes why.
The key point to understand with regard to volatile is that its
purpose is to suppress optimization, which is almost never what one
really wants to do. In the kernel, one must protect shared data
structures against unwanted concurrent access, which is very much a
different task. The process of protecting against unwanted
concurrency will also avoid almost all optimization-related problems
in a more efficient way.
Like volatile, the kernel primitives which make concurrent access to
data safe (spinlocks, mutexes, memory barriers, etc.) are designed to
prevent unwanted optimization. If they are being used properly, there
will be no need to use volatile as well. If volatile is still
necessary, there is almost certainly a bug in the code somewhere. In
properly-written kernel code, volatile can only serve to slow things
down.
Consider a typical block of kernel code:
spin_lock(&the_lock);
do_something_on(&shared_data);
do_something_else_with(&shared_data);
spin_unlock(&the_lock);
If all the code follows the locking rules, the value of shared_data
cannot change unexpectedly while the_lock is held. Any other code
which might want to play with that data will be waiting on the lock.
The spinlock primitives act as memory barriers - they are explicitly
written to do so - meaning that data accesses will not be optimized
across them. So the compiler might think it knows what will be in
shared_data, but the spin_lock() call, since it acts as a memory
barrier, will force it to forget anything it knows. There will be no
optimization problems with accesses to that data.
If shared_data were declared volatile, the locking would still be
necessary. But the compiler would also be prevented from optimizing
access to shared_data within the critical section, when we know that
nobody else can be working with it. While the lock is held,
shared_data is not volatile. When dealing with shared data, proper
locking makes volatile unnecessary - and potentially harmful.
The volatile storage class was originally meant for memory-mapped I/O
registers. Within the kernel, register accesses, too, should be
protected by locks, but one also does not want the compiler
"optimizing" register accesses within a critical section. But, within
the kernel, I/O memory accesses are always done through accessor
functions; accessing I/O memory directly through pointers is frowned
upon and does not work on all architectures. Those accessors are
written to prevent unwanted optimization, so, once again, volatile is
unnecessary.
Another situation where one might be tempted to use volatile is when
the processor is busy-waiting on the value of a variable. The right
way to perform a busy wait is:
while (my_variable != what_i_want)
cpu_relax();
The cpu_relax() call can lower CPU power consumption or yield to a
hyperthreaded twin processor; it also happens to serve as a memory
barrier, so, once again, volatile is unnecessary. Of course,
busy-waiting is generally an anti-social act to begin with.
There are still a few rare situations where volatile makes sense in
the kernel:
The above-mentioned accessor functions might use volatile on
architectures where direct I/O memory access does work. Essentially,
each accessor call becomes a little critical section on its own and
ensures that the access happens as expected by the programmer.
Inline assembly code which changes memory, but which has no other
visible side effects, risks being deleted by GCC. Adding the volatile
keyword to asm statements will prevent this removal.
The jiffies variable is special in that it can have a different value
every time it is referenced, but it can be read without any special
locking. So jiffies can be volatile, but the addition of other
variables of this type is strongly frowned upon. Jiffies is considered
to be a "stupid legacy" issue (Linus's words) in this regard; fixing it
would be more trouble than it is worth.
Pointers to data structures in coherent memory which might be modified
by I/O devices can, sometimes, legitimately be volatile. A ring buffer
used by a network adapter, where that adapter changes pointers to
indicate which descriptors have been processed, is an example of this
type of situation.
For most code, none of the above justifications for volatile apply.
As a result, the use of volatile is likely to be seen as a bug and
will bring additional scrutiny to the code. Developers who are
tempted to use volatile should take a step back and think about what
they are truly trying to accomplish.

I don't think you're wrong -- volatile is necessary to guarantee that thread A will see the value change, if the value is changed by something other than thread A. As I understand it, volatile is basically a way to tell the compiler "don't cache this variable in a register, instead be sure to always read/write it from RAM memory on every access".
The confusion is because volatile isn't sufficient for implementing a number of things. In particular, modern systems use multiple levels of caching, modern multi-core CPUs do some fancy optimizations at run-time, and modern compilers do some fancy optimizations at compile time, and these all can result in various side effects showing up in a different order from the order you would expect if you just looked at the source code.
So volatile is fine, as long as you keep in mind that the 'observed' changes in the volatile variable may not occur at the exact time you think they will. Specifically, don't try to use volatile variables as a way to synchronize or order operations across threads, because it won't work reliably.
Personally, my main (only?) use for the volatile flag is as a "pleaseGoAwayNow" boolean. If I have a worker thread that loops continuously, I'll have it check the volatile boolean on each iteration of the loop, and exit if the boolean is ever true. The main thread can then safely clean up the worker thread by setting the boolean to true, and then calling pthread_join() to wait until the worker thread is gone.

volatile is useful (albeit insufficient) for implementing the basic construct of a spinlock mutex, but once you have that (or something superior), you don't need another volatile.
The typical way of multithreaded programming is not to protect every shared variable at the machine level, but rather to introduce guard variables which guide program flow. Instead of volatile bool my_shared_flag; you should have
pthread_mutex_t flag_guard_mutex; // contains something volatile
bool my_shared_flag;
Not only does this encapsulate the "hard part," it's fundamentally necessary: C does not include atomic operations necessary to implement a mutex; it only has volatile to make extra guarantees about ordinary operations.
Now you have something like this:
pthread_mutex_lock( &flag_guard_mutex );
my_local_state = my_shared_flag; // critical section
pthread_mutex_unlock( &flag_guard_mutex );
pthread_mutex_lock( &flag_guard_mutex ); // may alter my_shared_flag
my_shared_flag = ! my_shared_flag; // critical section
pthread_mutex_unlock( &flag_guard_mutex );
my_shared_flag does not need to be volatile, despite being uncacheable, because
Another thread has access to it.
Meaning a reference to it must have been taken sometime (with the & operator).
(Or a reference was taken to a containing structure)
pthread_mutex_lock is a library function.
Meaning the compiler can't tell if pthread_mutex_lock somehow acquires that reference.
Meaning the compiler must assume that pthread_mutex_lock modifes the shared flag!
So the variable must be reloaded from memory. volatile, while meaningful in this context, is extraneous.

Your understanding really is wrong.
The property, that the volatile variables have, is "reads from and writes to this variable are part of perceivable behaviour of the program". That means this program works (given appropriate hardware):
int volatile* reg=IO_MAPPED_REGISTER_ADDRESS;
*reg=1; // turn the fuel on
*reg=2; // ignition
*reg=3; // release
int x=*reg; // fire missiles
The problem is, this is not the property we want from thread-safe anything.
For example, a thread-safe counter would be just (linux-kernel-like code, don't know the c++0x equivalent):
atomic_t counter;
...
atomic_inc(&counter);
This is atomic, without a memory barrier. You should add them if necessary. Adding volatile would probably not help, because it wouldn't relate the access to the nearby code (eg. to appending of an element to the list the counter is counting). Certainly, you don't need to see the counter incremented outside your program, and optimisations are still desirable, eg.
atomic_inc(&counter);
atomic_inc(&counter);
can still be optimised to
atomically {
counter+=2;
}
if the optimizer is smart enough (it doesn't change the semantics of the code).

For your data to be consistent in a concurrent environment you need two conditions to apply:
1) Atomicity i.e if I read or write some data to memory then that data gets read/written in one pass and cannot be interrupted or contended due to e.g a context switch
2) Consistency i.e the order of read/write ops must be seen to be the same between multiple concurrent environments - be that threads, machines etc
volatile fits neither of the above - or more particularly, the c or c++ standard as to how volatile should behave includes neither of the above.
It's even worse in practice as some compilers ( such as the intel Itanium compiler ) do attempt to implement some element of concurrent access safe behaviour ( i.e by ensuring memory fences ) however there is no consistency across compiler implementations and moreover the standard does not require this of the implementation in the first place.
Marking a variable as volatile will just mean that you are forcing the value to be flushed to and from memory each time which in many cases just slows down your code as you've basically blown your cache performance.
c# and java AFAIK do redress this by making volatile adhere to 1) and 2) however the same cannot be said for c/c++ compilers so basically do with it as you see fit.
For some more in depth ( though not unbiased ) discussion on the subject read this

The comp.programming.threads FAQ has a classic explanation by Dave Butenhof:
Q56: Why don't I need to declare shared variables VOLATILE?
I'm concerned, however, about cases where both the compiler and the
threads library fulfill their respective specifications. A conforming
C compiler can globally allocate some shared (nonvolatile) variable to
a register that gets saved and restored as the CPU gets passed from
thread to thread. Each thread will have it's own private value for
this shared variable, which is not what we want from a shared
variable.
In some sense this is true, if the compiler knows enough about the
respective scopes of the variable and the pthread_cond_wait (or
pthread_mutex_lock) functions. In practice, most compilers will not try
to keep register copies of global data across a call to an external
function, because it's too hard to know whether the routine might
somehow have access to the address of the data.
So yes, it's true that a compiler that conforms strictly (but very
aggressively) to ANSI C might not work with multiple threads without
volatile. But someone had better fix it. Because any SYSTEM (that is,
pragmatically, a combination of kernel, libraries, and C compiler) that
does not provide the POSIX memory coherency guarantees does not CONFORM
to the POSIX standard. Period. The system CANNOT require you to use
volatile on shared variables for correct behavior, because POSIX
requires only that the POSIX synchronization functions are necessary.
So if your program breaks because you didn't use volatile, that's a BUG.
It may not be a bug in C, or a bug in the threads library, or a bug in
the kernel. But it's a SYSTEM bug, and one or more of those components
will have to work to fix it.
You don't want to use volatile, because, on any system where it makes
any difference, it will be vastly more expensive than a proper
nonvolatile variable. (ANSI C requires "sequence points" for volatile
variables at each expression, whereas POSIX requires them only at
synchronization operations -- a compute-intensive threaded application
will see substantially more memory activity using volatile, and, after
all, it's the memory activity that really slows you down.)
/---[ Dave Butenhof ]-----------------------[ butenhof#zko.dec.com ]---\
| Digital Equipment Corporation 110 Spit Brook Rd ZKO2-3/Q18 |
| 603.881.2218, FAX 603.881.0120 Nashua NH 03062-2698 |
-----------------[ Better Living Through Concurrency ]----------------/
Mr Butenhof covers much of the same ground in this usenet post:
The use of "volatile" is not sufficient to ensure proper memory
visibility or synchronization between threads. The use of a mutex is
sufficient, and, except by resorting to various non-portable machine
code alternatives, (or more subtle implications of the POSIX memory
rules that are much more difficult to apply generally, as explained in
my previous post), a mutex is NECESSARY.
Therefore, as Bryan explained, the use of volatile accomplishes
nothing but to prevent the compiler from making useful and desirable
optimizations, providing no help whatsoever in making code "thread
safe". You're welcome, of course, to declare anything you want as
"volatile" -- it's a legal ANSI C storage attribute, after all. Just
don't expect it to solve any thread synchronization problems for you.
All that's equally applicable to C++.

This is all that "volatile" is doing:
"Hey compiler, this variable could change AT ANY MOMENT (on any clock tick) even if there are NO LOCAL INSTRUCTIONS acting on it. Do NOT cache this value in a register."
That is IT. It tells the compiler that your value is, well, volatile- this value may be altered at any moment by external logic (another thread, another process, the Kernel, etc.). It exists more or less solely to suppress compiler optimizations that will silently cache a value in a register that it is inherently unsafe to EVER cache.
You may encounter articles like "Dr. Dobbs" that pitch volatile as some panacea for multi-threaded programming. His approach isn't totally devoid of merit, but it has the fundamental flaw of making an object's users responsible for its thread-safety, which tends to have the same issues as other violations of encapsulation.

According to my old C standard, “What constitutes an access to an object that has volatile- qualified type is implementation-defined”. So C compiler writers could have choosen to have "volatile" mean "thread safe access in a multi-process environment". But they didn't.
Instead, the operations required to make a critical section thread safe in a multi-core multi-process shared memory environment were added as new implementation-defined features. And, freed from the requirement that "volatile" would provide atomic access and access ordering in a multi-process environment, the compiler writers prioritised code-reduction over historical implemention-dependant "volatile" semantics.
This means that things like "volatile" semaphores around critical code sections, which do not work on new hardware with new compilers, might once have worked with old compilers on old hardware, and old examples are sometimes not wrong, just old.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js