How 'undefined' a race condition can be? - c++

Let's say I define a following C++ object:
class AClass
{
public:
AClass() : foo(0) {}
uint32_t getFoo() { return foo; }
void changeFoo() { foo = 5; }
private:
uint32_t foo;
} aObject;
The object is shared by two threads, T1 and T2. T1 is constantly calling getFoo() in a loop to obtain a number (which will be always 0 if changeFoo() was not called before). At some point, T2 calls changeFoo() to change it (without any thread synchronization).
Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5 with modern computer architectures and compilers? All the assembler code I investigated so far was using 32-bit memory reads and writes, which seems to save the integrity of the operation.
What about other primitive types?
Practical means that you can give an example of an existing architecture or a standard-compliant compiler where this (or a similar situation with a different code) is theoretically possible. I leave the word modern a bit subjective.
Edit: I can see many people noticing that I should not expect 5 to be read ever. That is perfectly fine to me and I did not say I do (though thanks for pointing this aspect out). My question was more about what kind of data integrity violation can happen with the above code.

In practice, you will not see anything else than 0 or 5 as far as I know (maybe some weird 16 bits architecture with 32 bits int where this is not the case).
However whether you actually see 5 at all is not guaranteed.
Suppose I am the compiler.
I see:
while (aObject.getFoo() == 0) {
printf("Sleeping");
sleep(1);
}
I know that:
printf cannot change aObject
sleep cannot change aObject
getFoo does not change aObject (thanks inline definition)
And therefore I can safely transform the code:
while (true) {
printf("Sleeping");
sleep(1);
}
Because there is no-one else accessing aObject during this loop, according to the C++ Standard.
That is what undefined behavior means: blown up expectations.

In practice, all mainstream 32-bit architectures perform 32-bit reads and writes atomically. You'll never see anything other than 0 or 5.

Not sure what you're looking for. On most modern architectures, there is a very distinct possibility that getFoo() always returns 0, even after changeFoo has been called. With just about any decent compiler, it's almost guaranteed that getFoo(), will always return the same value, regardless of any calls to changeFoo, if it is called in a tight loop.
Of course, in any real program, there will be other reads and writes, which will be totally unsynchronized with regards to the changes in foo.
And finally, there are 16 bit processors, and there may also be a possibility with some compilers that the uint32_t isn't aligned, so that the accesses won't be atomic. (Of course, you're only changing bits in one of the bytes, so this might not be an issue.)

In practice (for those who did not read the question), any potential problem boils down to whether or not a store operation for an unsigned int is an atomic operation which, on most (if not all) machines you will likely write code for, it will be.
Note that this is not stated by the standard; it is specific to the architecture you are targeting. I cannot envision a scenario in which a calling thread will red anything other than 0 or 5.
As to the title... I am unaware of varying degrees of "undefined behavior". UB is UB, it is a binary state.

Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5 with modern computer architectures and compilers? What about other primitive types?
Sure - there is no guarantee that the entire data will be written and read in an atomic manner. In practice, you may end up with a read which occurred during a partial write. What may be interrupted, and when that happens depends on several variables. So in practice, the results could easily vary as size and alignment of types vary. Naturally, that variance may also be introduced as your program moves from platform to platform and as ABIs change. Furthermore, observable results may vary as optimizations are added and other types/abstractions are introduced. A compiler is free to optimize away much of your program; perhaps completely, depending of the scope of the instance (yet another variable which is not considered in the OP).
Beyond optimizers, compilers, and hardware specific pipelines: The kernel can even affect the manner in which this memory region is handled. Does your program Guarantee where the memory of each object resides? Probably not. Your object's memory may exist on separate virtual memory pages -- what steps does your program take to ensure the memory is read and written in a consistent manner for all platforms/kernels? (none, apparently)
In short: If you cannot play by the rules defined by the abstract machine, you should not use the interface of said abstract machine (e.g. you should just understand and use assembly if the specification of C++'s abstract machine is truly inadequate for your needs -- highly improbable).
All the assembler code I investigated so far was using 32-bit memory reads and writes, which seems to save the integrity of the operation.
That's a very shallow definition of "integrity". All you have is (pseudo-)sequential consistency. As well, the compiler needs only to behave as if in such a scenario -- which is far from strict consistency. The shallow expectation means that even if the compiler actually made no breaking optimization and performed reads and writes in accordance with some ideal or intention, that the result would be practically useless -- your program would observe changes typically 'long' after its occurrence.
The subject remains irrelevant, given what specifically you can Guarantee.

Undefined behavior means that the compiler can do what ever he wants. He could basically change your program to do what ever he likes, e.g. order a pizza.
See, #Matthieu M. answer for a less sarcastic version than this one. I won't delete this as I think the comments are important for the discussion.

Undefined behavior is guaranteed to be as undefined as the word undefined.
Technically, the observable behavior is pointless because it is simply undefined behavior, the compiler is not needed to show you any particular behavior. It may work as you think it should or not or may burn your computer, anything and everything is possible.

Related

Compiler optimizations and temporary assignments in C and C++

Please see the following code valid in C and C++:
extern int output;
extern int input;
extern int error_flag;
void func(void)
{
if (0 != error_flag)
{
output = -1;
}
else
{
output = input;
}
}
Is the compiler allowed to compile the above code in the same way as if it looked like below?
extern int output;
extern int input;
extern int error_flag;
void func(void)
{
output = -1;
if (0 == error_flag)
{
output = input;
}
}
In other words, is the compiler allowed to generate (from the first snippet) code that always makes a temporary assignment of -1 to output and then assign input value to output depending on error_flag status?
Would the compiler be allowed to do it if output would be declared as volatile?
Would the compiler be allowed to do it if output would be declared as atomic_int (stdatomic.h)?
Update after David Schwartz's comment:
If the compiler is free to add additional writes to a variable, it seems it is not possible to tell from the C code whether a data race exists or not. How to determine this?
Yes, the speculative assignment is possible. Modification of a non-volatile variable is not part of the observable behaviour of the program and thus a spurious write is allowed. (See below for the definition of "observable behaviour", which does not actually include all behaviour which you might observe.)
No. If output is volatile, speculative or spurious mutations are not permitted because the mutation is part of observable behaviour. (Writing to -- or reading from -- a hardware register may have consequences other than just storing a value. This is one of the primary use cases of volatile.)
(Edited) No, the speculative assignment is not possible with atomic output. Loads and stores of atomic variables are synchronized operations, so it should not be possible to load a value of such a variable which was not explicitly stored into the variable.
Observable behaviour
Although a program can do lots of obviously observable things (for example, abruptly terminating because of a segfault), the C and C++ standards only guarantee a limited set of results. Observable behaviour is defined in the C11 draft at §5.1.2.3p6 and in the current C++14 draft at §1.9p8 [intro.execution] with very similar wording:
The least requirements on a conforming implementation are:
— Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
— At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.
— The input and output dynamics of interactive devices shall take place in such a fashion that prompting output is actually delivered before a program waits for input. What constitutes an interactive device is implementation-defined.
These collectively are referred to as the observable behavior of the program.
The above is taken from the C++ standard; the C standard differs in that in the second point it does not allow multiple possible results, and in the third point it explicitly references a relevant section of the standard library requirements. But details aside, the definitions are co-ordinated; for the purpose of this question, the relevant point is that only access to volatile variables is observable (up to the point that the value of a non-volatile variable is sent to an output device or file).
Data Races
This paragraph needs also to be read in the overall context of the C and C++ standards, which free the implementation from all requirements if the program engenders undefined behaviour. That's why the segfault is not considered in the definition of observable behaviour above: the segfault is a possible undefined behaviour but not a possible behaviour in a conformant program. So in the universe of only conformant programs and conformant implementations, there are no segfaults.
That's important because a program with a data race is not conformant. A data race has undefined behaviour, even if it seems innocuous. And since it is the responsibility of the programmer to avoid undefined behaviour, the implementation may optimize without regard to data races.
The exposition of the memory model in the C and C++ standards is dense and technical, and probably not suitable as an introduction to the concepts. (Browsing around the material on Hans Boehm's site will probably prove less difficult.) Extracting quotes from the standard is risky, because the details are important. But here is a small leap into the morass, from the current C++14 standard, §1.10 [intro.multithread]:
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.
…
Two actions are potentially concurrent if
— they are performed by different threads, or
— they are unsequenced, and at least one is performed by a signal handler.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
The take-away here is that a read and a write of the same variable need to be synchronized; otherwise it is a data race and the result is undefined behaviour. Some programmers might object to the strictness of this prohibition, arguing that some data races are "benign". This is the topic of Hans Boehm's 2011 HotPar paper "How to miscompile programs with "benign" data races" (pdf) (author's summary: "There are no benign data races"), and he explains it all much better than I could.
Synchronization here includes the use of atomic types, so it is not a data race to concurrently read and modify an atomic variable. (The result of the read is unpredictable, but it must be either the value before the modification or the value afterwards.) This prevents the compiler from performing "piecemeal" modification of an atomic variable without some explicit synchronization.
After some thought and more research, my conclusion is that the compiler cannot perform speculative writes to atomic variables either. Consequently, I modified the answer to question 3, which I had originally answered "no".
Other useful references:
Bartosz Milewski: Dealing with Benign Data Races the C++ Way
Milewski deals with the precise issue of speculative writes to atomic variables, and concludes:
Can’t the compiler still do the same dirty trick, and momentarily store 42 in the owner variable? No, it can’t! Since the variable is declared atomic the compiler can no longer assume that the write can’t be observed by other threads.
Herb Sutter on Thread Safety and Synchronization
As usual, an accessible and well-written explanation.
Yes, the compiler is allowed to do that kind of optimization. In general, you can assume that the compiler (and the CPU too) can reorder your code assuming it is running in a single thread. If you have more than one thread, you need to synchronize. If you don't synchronize and your code writes to a memory location that is written to or read by another thread, your code contains a data race, in C++ this is undefined behavior.
volatile doesn't change the data race problem. However IIRC, the compiler is not allowed to reorder reads and writes to a volatile variable.
When using atomic_int, the compiler can still perform certain optimizations. I don't think that the compiler can invent writes though (that could break a multithreaded program). However, it can still reorder operations, so be careful.

Efficiency penalty of initializing a struct/class within a loop

I've done my best to find an answer to this with no luck. Also, I've tested it and don't see any difference whatsoever in an optimized release build (there is a difference in debug)... still, I can't imagine why there is no difference, or how the optimizer is able to remove the penalty, and maybe someone knows what is happening internally.
If I create new instances of a simple class/struct within a loop, is there any penalty in efficiency for creating the class/struct on every loop iteration?
i.e.
struct mystruct
{
inline mystruct(const double &initial) : _myvalue(initial) {}
double myvalue;
}
why does...
for(int i=0; i<big_int; ++i)
{
mystruct a = mystruct(1.1)
}
take the same amount of real time as
for(int i=0; i<big_int; ++i)
{
double s = 1.1
}
?? Shouldn't there be some time required for the constructor/initialization?
This is easy-peasy work for a modern optimizer to handle.
As a programmer you might look at that constructor and struct and think it has to cost something. "The constructor code involves branching, passing arguments through registers/stack, popping from the stack, etc. The struct is a user-defined type, it must add more data somewhere. There's aliasing/indirection overhead for the const reference, etc."
Except the optimizer then has a go at your code, and it notices that the struct has no virtual functions, it has no objects that require a non-trivial constructor. The whole thing fits into a general-purpose register. And then it notices that your constructor is doing little more than assigning one variable to another. And it'll probably even notice that you're just calling it with a literal constant, which translates to a single move/store instruction to a register which doesn't even require any additional memory beyond the instruction.
It's all very magical, and compilers are sophisticated beasts, but they usually do this in multiple passes, and from your original code to intermediate representations, and from intermediate representations to machine code. To really appreciate and understand what they do, it's worth having a peek at the disassembly from time to time.
It's worth noting that C++ has been around for decades. As a successor to C, it originally was pushed mostly as an object-oriented language with hot concepts like encapsulation and information hiding. To promote a language where people start replacing public data members and manual initialization/destruction and things like that for simple accessor functions, constructors, destructors, it would have been very difficult to popularize the language if there was a measurable overhead in even a simple function call. So as magical as this all sounds, C++ optimizers have been doing this now for decades, squashing all that overhead you add to make things easier to maintain down to the same assembly as something which wouldn't be so easy to maintain.
So it's generally worth thinking of things like function calls and small structures as being basically free, since if it's worth inlining and squashing away all the overhead to zilch, optimizers will generally do it. Exceptions arise with indirect function calls: virtual methods, calls through function pointers, etc. But the code you posted is easy stuff for a modern optimizer to squash down.
C++ philosophy is that you should not "pay" (in CPU cycles or in memory bytes) for anything that you do not use. The struct in your example is nothing more than a double with a constructor tied to it. Moreover, the constructor can be inlined, bringing the overhead all the way down to zero.
If your struct had other parts to initialize, such as other fields or a table of virtual functions, there would be some overhead. The way your example is set up, however, the compiler can optimize out the constructor, producing an assembly output that boils down to a single assignment of a double.
Neither of your loops do anything. Dead code may be removed. Furthermore, there is no representational difference between a struct containing a single double and a primitive double. The compier should be able to easily "see through" an inline constructor. C++ relies on optimisations of these things to allow its abstractions to compete with hand-written versions.
There is no reason for the performance to be different, and if it were, I would consider it a bug (up to debug builds, where debug information could change the performance cost).
These quotes from the C++ Standard may help to understand what optimization is permitted:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
and also:
The least requirements on a conforming implementation are:
Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place in such a fashion that prompting output is actually delivered before a program waits for input. What constitutes an interactive device is implementation-defined.
These collectively are referred to as the observable behavior of the program.
To summarize: the compiler can generate whatever executable it likes so long as that executable performs the same I/O and access to volatile variables as the unoptimized version would. In particular, there are no requirements about timing or memory allocation.
In your code sample, the entire thing could be optimized out as it produces no observable behaviour. However, real-world compilers sometimes decide to leave in things that could be optimized out, if they think the programmer really wanted those operations to happen for some reason.
#Ikes answer is exactly what I was getting at. However, If you are curious about this question, I very much recommend reading answers of #dasblinkenlight, #Mankarse, and #Matt McNabb and the discussions below them, which get at the details of the situation. Thanks all.

May accesses to volatiles be reordered?

Consider the following sequence of writes to volatile memory, which I've taken from David Chisnall's article at InformIT, "Understanding C11 and C++11 Atomics":
volatile int a = 1;
volatile int b = 2;
a = 3;
My understanding from C++98 was that these operations could not be reordered, per C++98 1.9:
conforming
implementations are required to emulate (only) the observable behavior of the abstract machine as
explained below
...
The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and
calls to library I/O functions
Chisnall says that the constraint on order preservation applies only to individual variables, writing that a conforming implementation could generate code that does this:
a = 1;
a = 3;
b = 2;
Or this:
b = 2;
a = 1;
a = 3;
C++11 repeats the C++98 wording that
conforming
implementations are required to emulate (only) the observable behavior of the abstract machine as explained
below.
but says this about volatiles (1.9/8):
Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
1.9/12 says that accessing a volatile glvalue (which includes the variables a, b, and c above) is a side effect, and 1.9/14 says that the side effects in one full expression (e.g., a statement) must precede the side effects of a later full expression in the same thread. This leads me to conclude that the two reorderings Chisnall shows are invalid, because they do not correspond to the ordering dictated by the abstract machine.
Am I overlooking something, or is Chisnall mistaken?
(Note that this is not a threading question. The question is whether a compiler is permitted to reorder accesses to different volatile variables in a single thread.)
IMO Chisnalls interpretation (as presented by you) is clearly wrong. The simpler case is C++98. The sequence of reads and writes to volatile data needs to be preserved and that applies to the ordered sequence of reads and writes of any volatile data, not to a single variable.
This becomes obvious, if you consider the original motivation for volatile: memory-mapped I/O. In mmio you typically have several related registers at different memory location and the protocol of an I/O device requires a specific sequence of reads and writes to its set of registers - order between registers is important.
The C++11 wording avoids talking about an absolute sequence of reads and writes, because in multi-threaded environments there is not one single well-defined sequence of such events across threads - and that is not a problem, if these accesses go to independent memory locations. But I believe the intent is that for any sequence of volatile data accesses with a well-defined order the rules remain the same as for C++98 - the order must be preserved, no matter how many different locations are accessed in that sequence.
It is an entirely separate issue what that entails for an implementation. How (and even if) a volatile data access is observable from outside the program and how the access order of the program maps to externally observable events is unspecified. An implementation should probably give you a reasonable interpretation and reasonable guarantees, but what is reasonable depends on the context.
The C++11 standard leaves room for data races between unsynchronized volatile accesses, so there is nothing that requires surrounding these by full memory fences or similar constructs. If there are parts of memory that are truly used as external interface - for memory-mapped I/O or DMA - then it may be reasonable for the implementation to give you guarantees for how volatile accesses to these parts are exposed to consuming devices.
One guarantee can probably be inferred from the standard (see [into.execution]): values of type volatile std::sigatomic_t must have values compatible with the order of writes to them even in a signal handler - at least in a single-threaded program.
You're right, he's wrong. Accesses to distinct volatile variables cannot be reordered by the compiler as long as they occur in separate full expressions i.e. are separated by what C++98 called a sequence point, or in C++11 terms one access is sequenced before the other.
Chisnall seems to be trying to explain why volatile is useless for writing thread-safe code, by showing a simple mutex implementation relying on volatile that would be broken by compiler reorderings. He's right that volatile is useless for thread-safety, but not for the reasons he gives. It's not because the compiler might reorder accesses to volatile objects, but because the CPU might reorder them. Atomic operations and memory barriers prevent the compiler and the CPU from reordering things across the barrier, as needed for thread-safety.
See the bottom right cell of Table 1 at Sutter's informative volatile vs volatile article.
For the moment, I'm going to assume your a=3s are just a mistake in copying and pasting, and you really meant them to be c=3.
The real question here is one of the difference between evaluation, and how things become visible to another processor. The standards describe order of evaluation. From that viewpoint, you're entirely correct -- given assignments to a, b and c in that order, the assignments must be evaluated in that order.
That may not correspond to the order in which those values become visible to other processors though. On a typical (current) CPU, that evaluation will only write values out to the cache. The hardware can reorder things from there though, so (for example) writes out to main memory happen in an entirely different order. Likewise, if another processor attempts to use the values, it may see them as changing in a different order.
Yes, this is entirely allowable -- the CPU is still evaluating the assignments in exactly the order prescribed by the standard, so the requirements are met. The standard simply doesn't place any requirements on what happens after evaluation, which is what happens here.
I should add: on some hardware it is sufficient though. For example, the x86 uses cache snooping, so if another processor tries to read a value that's been updated by one processor (but is still only in the cache) the processor that has the current value will put a hold on the read by the other processor until the current value can be written out so the other processor will see the current value.
That's not the case with all hardware though. While maintaining that strict model keeps things simple, it's also fairly expensive both in terms of extra hardware to ensure consistency and in simple speed when/if you have a lot of processors.
Edit: if we ignore threading for a moment, the question gets a little simpler -- but not much. According to C++11, §1.9/12:
When a call to a library I/O function returns or an access to a volatile object is evaluated the side effect is considered complete, even though some external actions implied by the call (such as the I/O itself) or by the volatile access may not have completed yet.
As such, the accesses to volatile objects must be initiated in order, but not necessarily completed in order. Unfortunately, it's often the completion that's externally visible. As such, we pretty much come back to the usual as-if rule: the compiler can rearrange things as much as it wants, as long it produces no externally visible change.
Looks like it can happen.
There is a discussion on this page:
http://gcc.gnu.org/ml/gcc/2003-11/msg01419.html
It depends on your compiler. For example, MSVC++ as of Visual Studio 2005 guarantees* volatiles will not be reordered (actually, what Microsoft did is give up and assume programmers will forever abuse volatile - MSVC++ now adds a memory barrier around certain usages of volatile). Other versions and other compilers may not have such guarantees.
Long story short: don't bet on it. Design your code properly, and don't misuse volatile. Use memory barriers instead or full-blown mutexes as necessary. C++11's atomic types will help.
C++98 doesn't say the instructions cannot be re-ordered.
The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions
This says it's the actual sequence of the reads and writes themselves, not the instructions that generate them. Any argument that says that the instructions must reflect the reads and writes in program order could equally argue that the reads and writes to the RAM itself must occur in program order, and clearly that's an absurd interpretation of the requirement.
Simply put, this doesn't mean anything. There is no "one right place" to observe the orders of reads and writes (The RAM bus? The CPU bus? Between the L1 and L2 caches? From another thread? From another core?), so this requirement is essentially meaningless.
Versions of C++ prior to any references to threads clearly don't specify the behavior of volatile variables as seen from another thread. And C++11 (wisely, IMO) didn't change this but instead introduced sensible atomic operations with well-defined inter-thread semantics.
As for memory-mapped hardware, that's always going to be platform-specific. The C++ standard doesn't even pretend to address how that might be done properly. For example, the platform might be such that only a subset of memory operations are legal in that context, say ones that bypass a write posting buffer that can reorder, and the C++ standard certainly doesn't compel the compiler to emit the right instructions for that particular hardware device -- how could it?
Update: I see some downvotes because people don't like this truth. Unfortunately, it is true.
If the C++ standard prohibits the compiler from reordering accesses to distinct volatiles, on the theory that the order of such accesses is part of the program's observable behavior, then it also requires the compiler to emit code that prohibits the CPU from doing so. The standard does not differentiate between what the compiler does and what the compiler's generated code makes the CPU do.
Since nobody believes the standard requires the compiler to emit instructions to keep the CPU from reordering accesses to volatile variables, and modern compilers don't do this, nobody should believe the C++ standard prohibits the compiler from reordering accesses to distinct volatiles.

Is writing to memory an observable behaviour?

I've looked at the standard but couldn't find any indication that simply writing to memory would be considered observable behaviour. If not, that would mean the compiled code need not actually write to that memory. If a compiler choose to optimize away such access anything involving mapper memory, or shared memory, may not work.
1.9-8 seems to defined a very limited observable behaviour but indicates an implementation may define more. Can one assume than any quality compiler would treat modifying memory as an observable behaviour? That is, it may not guarantee atomicity or ordering, but does guarantee that data will eventually be written.
So, have I overlooked something in the standard, or is the writing to memory merely something the compiler decides to do?
Statements from the current or C++0x standard are good. Please note I'm not talking about accessing memory through a function, I mean direct access, such as writing data to a pointer (perhaps retrieved via mmap or another library function).
This kind of thing is what volatile exists for. Else, writing to memory and never apparently reading from it is not observable behaviour. However, in the general case, it would be quite impossible for the optimizer to prove that you never read it back except in relatively trivial examples, so it's not usually an issue.
Can one assume than any quality compiler would treat modifying memory as an observable behaviour?
No. Volatile is meant for marking that. However, you cannot fully trust the compiler even after adding the volatile qualifier, at least as told by a 2008 paper: http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf
EDIT:
From C standard (not C++) http://c0x.coding-guidelines.com/5.1.2.3.html
An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
My reading of C99 is that unless you specify volatile, how and when the variable is actually accessed is implementation defined. If you specify volatile qualifier then code must work according to the rules of an abstract machine.
Relevant parts in the standard are: 6.7.3 Type qualifiers (volatile description) and 5.1.2.3 Program execution (the abstract machine definition).
For some time now I know that many compilers actually have heuristics to detect cases when a variable should be reread again and when it is okay to use a cached copy. Volatile makes it clear to the compiler that every access to the variable should be actually an access to the memory. Without volatile it seems compiler is free to never reread the variable.
And BTW wrapping the access in a function doesn't change that since a function even without inline might be still inlined by the compiler within the current compilation unit.
From your question below:
Assume I use an array on the heap (unspecified where it is allocated),
and I use that array to perform a calculation (temp space). The
optimizer sees that it doesn't actually need any of that space as it
can use strictly registers. Does the compiler nonetheless write the
temp values to the memory?
Per MSalters below:
It's not guaranteed, and unlikely. Consider a a Static Single
Assignment optimizer. This figures out each possible write/read
dependency, and then assigns registers to optimize these dependencies.
As a side effect, any write that's not followed by a (possible) read
creates no dependencies at all, and is eliminated. In your example
("use strictly registers") the optimizer has satisfied all write/read
dependencies with registers, so it won't write to memory at all. All
reads produce the correct values, so it's a correct optimization.

What Rules does compiler have to follow when dealing with volatile memory locations?

I know when reading from a location of memory which is written to by several threads or processes the volatile keyword should be used for that location like some cases below but I want to know more about what restrictions does it really make for compiler and basically what rules does compiler have to follow when dealing with such case and is there any exceptional case where despite simultaneous access to a memory location the volatile keyword can be ignored by programmer.
volatile SomeType * ptr = someAddress;
void someFunc(volatile const SomeType & input){
//function body
}
What you know is false. Volatile is not used to synchronize memory access between threads, apply any kind of memory fences, or anything of the sort. Operations on volatile memory are not atomic, and they are not guaranteed to be in any particular order. volatile is one of the most misunderstood facilities in the entire language. "Volatile is almost useless for multi-threadded programming."
What volatile is used for is interfacing with memory-mapped hardware, signal handlers and the setjmp machine code instruction.
It can also be used in a similar way that const is used, and this is how Alexandrescu uses it in this article. But make no mistake. volatile doesn't make your code magically thread safe. Used in this specific way, it is simply a tool that can help the compiler tell you where you might have messed up. It is still up to you to fix your mistakes, and volatile plays no role in fixing those mistakes.
EDIT: I'll try to elaborate a little bit on what I just said.
Suppose you have a class that has a pointer to something that cannot change. You might naturally make the pointer const:
class MyGizmo
{
public:
const Foo* foo_;
};
What does const really do for you here? It doesn't do anything to the memory. It's not like the write-protect tab on an old floppy disc. The memory itself it still writable. You just can't write to it through the foo_ pointer. So const is really just a way to give the compiler another way to let you know when you might be messing up. If you were to write this code:
gizmo.foo_->bar_ = 42;
...the compiler won't allow it, because it's marked const. Obviously you can get around this by using const_cast to cast away the const-ness, but if you need to be convinced this is a bad idea then there is no help for you. :)
Alexandrescu's use of volatile is exactly the same. It doesn't do anything to make the memory somehow "thread safe" in any way whatsoever. What it does is it gives the compiler another way to let you know when you may have screwed up. You mark things that you have made truly "thread safe" (through the use of actual synchronization objects, like Mutexes or Semaphores) as being volatile. Then the compiler won't let you use them in a non-volatile context. It throws a compiler error you then have to think about and fix. You could again get around it by casting away the volatile-ness using const_cast, but this is just as Evil as casting away const-ness.
My advice to you is to completely abandon volatile as a tool in writing multithreadded applications (edit:) until you really know what you're doing and why. It has some benefit but not in the way that most people think, and if you use it incorrectly, you could write dangerously unsafe applications.
It's not as well defined as you probably want it to be. Most of the relevant standardese from C++98 is in section 1.9, "Program Execution":
The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions.
Accessing an object designated by a volatile lvalue (3.10), modifying an object, calling a library I/O function, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment. Evaluation of an expression might produce side effects. At certain specified points in the execution sequence called sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have taken place.
Once the execution of a function begins, no expressions from the calling function are evaluated until execution of the called function has completed.
When the processing of the abstract machine is interrupted by receipt of a signal, the values of objects with type other than volatile sig_atomic_t are unspecified, and the value of any object not of volatile sig_atomic_t that is modified by the handler becomes undefined.
An instance of each object with automatic storage duration (3.7.2) is associated with each entry into its block. Such an object exists and retains its last-stored value during the execution of the block and while the block is suspended (by a call of a function or receipt of a signal).
The least requirements on a conforming implementation are:
At sequence points, volatile objects are stable in the sense that previous evaluations are complete and subsequent evaluations have not yet occurred.
At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place in such a fashion that prompting messages actually appear prior to a program waiting for input. What constitutes an interactive device is implementation-defined.
So what that boils down to is:
The compiler cannot optimize away reads or writes to volatile objects. For simple cases like the one casablanca mentioned, that works the way you might think. However, in cases like
volatile int a;
int b;
b = a = 42;
people can and do argue about whether the compiler has to generate code as if the last line had read
a = 42; b = a;
or if it can, as it normally would (in the absence of volatile), generate
a = 42; b = 42;
(C++0x may have addressed this point, I haven't read the whole thing.)
The compiler may not reorder operations on two different volatile objects that occur in separate statements (every semicolon is a sequence point) but it is totally allowed to rearrange accesses to non-volatile objects relative to volatile ones. This is one of the many reasons why you should not try to write your own spinlocks, and is the primary reason why John Dibling is warning you not to treat volatile as a panacea for multithreaded programming.
Speaking of threads, you will have noticed the complete absence of any mention of threads in the standards text. That is because C++98 has no concept of threads. (C++0x does, and may well specify their interaction with volatile, but I wouldn't be assuming anyone implements those rules yet if I were you.) Therefore, there is no guarantee that accesses to volatile objects from one thread are visible to another thread. This is the other major reason volatile is not especially useful for multithreaded programming.
There is no guarantee that volatile objects are accessed in one piece, or that modifications to volatile objects avoid touching other things right next to them in memory. This is not explicit in what I quoted but is implied by the stuff about volatile sig_atomic_t -- the sig_atomic_t part would be unnecessary otherwise. This makes volatile substantially less useful for access to I/O devices than it was probably intended to be, and compilers marketed for embedded programming often offer stronger guarantees, but it's not something you can count on.
Lots of people try to make specific accesses to objects have volatile semantics, e.g. doing
T x;
*(volatile T *)&x = foo();
This is legit (because it says "object designated by a volatile lvalue" and not "object with a volatile type") but has to be done with great care, because remember what I said about the compiler being totally allowed to reorder non-volatile accesses relative to volatile ones? That goes even if it's the same object (as far as I know anyway).
If you are worried about reordering of accesses to more than one volatile value, you need to understand the sequence point rules, which are long and complicated and I'm not going to quote them here because this answer is already too long, but here's a good explanation which is only a little simplified. If you find yourself needing to worry about the differences in the sequence point rules between C and C++ you have already screwed up somewhere (for instance, as a rule of thumb, never overload &&).
A particular and very common optimization that is ruled out by volatile is to cache a value from memory into a register, and use the register for repeated access (because this is much faster than going back to memory every time).
Instead the compiler must fetch the value from memory every time (taking a hint from Zach, I should say that "every time" is bounded by sequence points).
Nor can a sequence of writes make use of a register and only write the final value back later on: every write must be pushed out to memory.
Why is this useful? On some architectures certain IO devices map their inputs or outputs to a memory location (i.e. a byte written to that location actually goes out on the serial line). If the compiler redirects some of those writes to a register that is only flushed occasionally then most of the bytes won't go onto the serial line. Not good. Using volatile prevents this situation.
Declaring a variable as volatile means the compiler can't make any assumptions about the value that it could have done otherwise, and hence prevents the compiler from applying various optimizations. Essentially it forces the compiler to re-read the value from memory on each access, even if the normal flow of code doesn't change the value. For example:
int *i = ...;
cout << *i; // line A
// ... (some code that doesn't use i)
cout << *i; // line B
In this case, the compiler would normally assume that since the value at i wasn't modified in between, it's okay to retain the value from line A (say in a register) and print the same value in B. However, if you mark i as volatile, you're telling the compiler that some external source could have possibly modified the value at i between line A and B, so the compiler must re-fetch the current value from memory.
The compiler is not allowed to optimize away reads of a volatile object in a loop, which otherwise it'd normally do (i.e. strlen()).
It's commonly used in embedded programming when reading a hardware registry at a fixed address, and that value may change unexpectedly. (In contrast with "normal" memory, that doesn't change unless written to by the program itself...)
That is it's main purpose.
It could also be used to make sure one thread see the change in a value written by another, but it in no way guarantees atomicity when reading/writing to said object.