Function not called in code gets called at runtime

Function not called in code gets called at runtime - c++

How can the following program be calling format_disk if it's never
called in code?
#include <cstdio>
static void format_disk()
{
std::puts("formatting hard disk drive!");
}
static void (*foo)() = nullptr;
void never_called()
{
foo = format_disk;
}
int main()
{
foo();
}
This differs from compiler to compiler. Compiling with Clang with
optimizations on, the function never_called executes at runtime.
$ clang++ -std=c++17 -O3 a.cpp && ./a.out
formatting hard disk drive!
Compiling with GCC, however, this code just crashes:
$ g++ -std=c++17 -O3 a.cpp && ./a.out
Segmentation fault (core dumped)
Compilers version:
$ clang --version
clang version 5.0.0 (tags/RELEASE_500/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ gcc --version
gcc (GCC) 7.2.1 20171128
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The program contains undefined behavior, as dereferencing a null pointer
(i.e. calling foo() in main without assigning a valid address to it
beforehand) is UB, therefore no requirements are imposed by the standard.
Executing format_disk at runtime is a perfect valid situation when
undefined behavior has been hit, it's as valid as just crashing (like
when compiled with GCC). Okay, but why is Clang doing that? If you
compile it with optimizations off, the program will no longer output
"formatting hard disk drive", and will just crash:
$ clang++ -std=c++17 -O0 a.cpp && ./a.out
Segmentation fault (core dumped)
The generated code for this version is as follows:
main: # #main
push rbp
mov rbp, rsp
call qword ptr [foo]
xor eax, eax
pop rbp
ret
It tries to make a call to a function to which foo points, and as foo
is initialized with nullptr (or if it didn't have any initialization,
this would still be the case), its value is zero. Here, undefined
behavior has been hit, so anything can happen at all and the program
is rendered useless. Normally, making a call to such an invalid address
results in segmentation fault errors, hence the message we get when
executing the program.
Now let's examine the same program but compiling it with optimizations on:
$ clang++ -std=c++17 -O3 a.cpp && ./a.out
formatting hard disk drive!
The generated code for this version is as follows:
never_called(): # #never_called()
ret
main: # #main
push rax
mov edi, .L.str
call puts
xor eax, eax
pop rcx
ret
.L.str:
.asciz "formatting hard disk drive!"
Interestingly, somehow optimizations modified the program so that
main calls std::puts directly. But why did Clang do that? And why is
never_called compiled to a single ret instruction?
Let's get back to the standard (N4660, specifically) for a moment. What
does it say about undefined behavior?
3.27 undefined behavior [defns.undefined]
behavior for which this document imposes no requirements
[Note: Undefined behavior may be expected when this document omits
any explicit definition of behavior or when a program uses an erroneous
construct or erroneous data. Permissible undefined behavior ranges
from ignoring the situation completely with unpredictable results, to
behaving during translation or program execution in a documented manner
characteristic of the environment (with or without the issuance of a
diagnostic message), to terminating a translation or execution (with the
issuance of a diagnostic message). Many erroneous program constructs
do not engender undefined behavior; they are required to be diagnosed.
Evaluation of a constant expression never exhibits behavior explicitly
specified as undefined ([expr.const]). — end note]
Emphasis mine.
A program that exhibits undefined behavior becomes useless, as everything
it has done so far and will do further has no meaning if it contains
erroneous data or constructs. With that in mind, do remember that
compilers may completely ignore for the case when undefined behavior
is hit, and this actually is used as discovered facts when optimizing a
program. For instance, a construct like x + 1 > x (where x is a signed integer) will be optimized away to a constant,
true, even if the value of x is unknown at compile-time. The reasoning
is that the compiler wants to optimize for valid cases, and the only
way for that construct to be valid is when it doesn't trigger arithmetic
overflow (i.e. if x != std::numeric_limits<decltype(x)>::max()). This
is a new learned fact in the optimizer. Based on that, the construct is
proven to always evaluate to true.
Note: this same optimization can't occur for unsigned integers, because overflowing one is not UB. That is, the compiler needs to keep the expression as it is, as it might have a different evaluation when overflow occurs (unsigned is module 2N, where N is number of bits). Optimizing it away for unsigned integers would be incompliant with the standard (thanks aschepler).
This is useful as it allows for tons of optimizations to kick
in. So
far, so good, but what happens if x holds its maximum value at runtime?
Well, that is undefined behavior, so it's nonsense to try to reason about
it, as anything may happen and the standard imposes no requirements.
Now we have enough information in order to better examine your faulty
program. We already know that accessing a null pointer is undefined
behavior, and that's what's causing the funny behavior at runtime.
So let's try and understand why Clang (or technically LLVM) optimized
the program the way it did.
static void (*foo)() = nullptr;
static void format_disk()
{
std::puts("formatting hard disk drive!");
}
void never_called()
{
foo = format_disk;
}
int main()
{
foo();
}
Remember that it's possible to call never_called before the main entry
starts executing. For example, when declaring a top-level variable,
you can call it while initializing the value of that variable:
void never_called();
int x = (never_called(), 42);
If you write this snippet in your program, the program no
longer exhibits undefined behavior, and the message "formatting hard
disk drive!" is displayed, with optimizations either on or off.
So what's the only way this program is valid? There's this never_caled
function that assigns the address of format_disk to foo, so we might
find something here. Note that foo is marked as static, which means it
has internal linkage and can't be accessed from outside this translation
unit. In contrast, the function never_called has external linkage, and may
be accessed from outside. If another translation unit contains a snippet
like the one above, then this program becomes valid.
Cool, but there's no one calling never_called from outside. Even though this
is the fact, the optimizer sees that the only way for this program to
be valid is if never_called is called before main executes, otherwise it's
just undefined behavior. That's a new learned fact, so the compiler assumes never_called
is in fact called. Based on that new knowledge, other optimizations that
kick in may take advantage of it.
For instance, when constant
folding is
applied, it sees that the construct foo() is only valid if foo can be properly initialized. The only way for that to happen is if never_called is called outside of this translation unit, so foo = format_disk.
Dead code elimination and interprocedural optimization might find out that if foo == format_disk, then the code inside never_called is unneeded,
so the function's body is transformed into a single ret instruction.
Inline expansion optimization
sees that foo == format_disk, so the call to foo can be replaced
with its body. In the end, we end up with something like this:
never_called():
ret
main:
mov edi, .L.str
call puts
xor eax, eax
ret
.L.str:
.asciz "formatting hard disk drive!"
Which is somewhat equivalent to the output of Clang with optimizations on. Of course, what Clang really did may (and might) be different, but optimizations are nonetheless capable of reaching the same conclusion.
Examining GCC's output with optimizations on, it seems it didn't bother investigating:
.LC0:
.string "formatting hard disk drive!"
format_disk():
mov edi, OFFSET FLAT:.LC0
jmp puts
never_called():
mov QWORD PTR foo[rip], OFFSET FLAT:format_disk()
ret
main:
sub rsp, 8
call [QWORD PTR foo[rip]]
xor eax, eax
add rsp, 8
ret
Executing that program results in a crash (segmentation fault), but if you call never_called in another translation unit before main gets executed, then this program doesn't exhibit undefined behavior anymore.
All of this can change crazily as more and more optimizations are engineered, so do not rely on the assumption that your compiler will take care of code containing undefined behavior, it might just screw you up as well (and format your hard drive for real!)
I recommend you read What every C programmer should know about Undefined Behavior and A Guide to Undefined Behavior in C and C++, both article series are very informative and might help you out with understanding the state of art.

Unless an implementation specifies the effect of trying to invoke a null function pointer, it could behave as a call to arbitrary code. Such arbitrary code could perfectly well behave like a call to function "foo()". While Annex L of the C Standard would invite implementations to distinguish between "Critical UB" and "non-critical UB", and some C++ implementations might apply a similar distinction, a invoking an invalid function pointer would be critical UB in any case.
Note that the situation in this question is very different from e.g.
unsigned short q;
unsigned hey(void)
{
if (q < 50000)
do_something();
return q*q;
}
In the latter situation, a compiler which does not claim to be "analyzable" might recognize that code will invoke if q is greater than 46,340 when execution reaches the return statement, and thus it might as well invoke do_something() unconditionally. While Annex L is badly written, it would seem the intention would be to forbid such "optimizations". In the case of calling an invalid function pointer, however, even straightforwardly-generated code on most platforms might have arbitrary behavior.

Related

How can I make clang not optimize allocation failure check by default

Consider the following function
int demofunc() {
float* ptr = new float[1000];
if (ptr==nullptr)
{
return -1;
}
return 0;
}
When compiled with clang14, produces a working code. Once -O1 or more aggressive is enabled, the if condition gets removed and the output assembly becomes :
demofunc(): # #demofunc()
xor eax, eax
ret
I can declare my pointer as float* volatile and the branch doesn't get removed, but I don't want to handle this case by case.
Now, I've read in the c++ standard that an allocator that does not have the nothrow property signal failure with a bad_alloc exception while one with nothrow property retuns nullptr.
I get that assumption of success even if I specify fno-exceptions which, I think, should not happen.
I need to specify that I am dealing with a custom port of clang for a specific hardware that has no exception support and I do get new that returns nullptr if the heap gets depleted. I am not exactly sure if this is a problem with clang port or clang itself.
The question goes as follow:
Why does the failure check branch gets optimized?
Can I specify a compiler option to ensure that it doesn't get removed, even with -O1 (or more aggressive).

Is C++ compiler obligated to generate 'ret' instruction for a function without 'return' statement?

Recently I encountered a problem that a thread in my programe was somehow stopped (GDB indicates thread's state as STOPPED). I have spent a couple of days debugging. This problem happended after I changed the version of the compiler some days ago. Finally I found out that the problem happens because a function is defined with int return type but does not have a return statement. So the asm code of the function does not have a ret instruction. When it reaches the end of the funcion in runtime, since no ret instruction exists, execution continues on the next instruction which is in another funciton. This causes the thread being stopped. I have checked the asm code generated by the compiler of previous version, and the same function does have a ret instruction.
In my experience, missing return statement in a function with non-void return type is not suggested, but should not results in thead stop. I want to know if I am wrong, or the compiler is wrong.
My question is, for a normally ended (not ended by exception, call, jmp...) function with non-void return type, is the compiler obligated to generate a ret instruction or not. Does it have any C++ standards or C++ specifications illustrating the compiler behavoir in this case?
I have done some search and found an answer for c# in this. How about for C++?

No.
For any non-void function except main, if there is any path through the function that does not return, your program exhibits undefined behaviour if that path is taken at runtime. Then the compiler is allowed to do whatever it feels like. Generally this results in just a missing ret statement but I have seen much weirder effects.
The standard specifies this in [stmt.return].2:
[..] Flowing off the end of a constructor, a destructor, or a
non-coroutine function with a cv void return type is equivalent to a
return with no operand. Otherwise, flowing off the end of a function
other than main (6.9.3.1) or a coroutine (9.5.4) results in undefined
behavior.
(emphasis added)

Should volatile still be used for sharing data with ISRs in modern C++?

I've seen some flavors of these question around and I've seen mixed answers, still unsure whether they are up-to-date and fully apply to my use case, so I'll ask here. Do let me know if it's a duplicate!
Given that I'm developing for STM32 microcontrollers (bare-metal) using C++17 and the gcc-arm-none-eabi-9 toolchain:
Do I still need to use volatile for sharing data between an ISR and main()?
volatile std::int32_t flag = 0;
extern "C" void ISR()
{
flag = 1;
}
int main()
{
while (!flag) { ... }
}
It's clear to me that I should always use volatile for accessing memory-mapped HW registers.
However for the ISR use case I don't know if it can be considered a case of "multithreading" or not. In that case, people recommend using C++11's new threading features (e.g. std::atomic). I'm aware of the difference between volatile (don't optimize) and atomic (safe access), so the answers suggesting std::atomic confuse me here.
For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile.
In other words: can the compiler know that flag can change inside ISR? If not, how can it know it in regular multithreaded applications?
Thanks!

I think that in this case both volatile and atomic will most likely work in practice on the 32 bit ARM. At least in an older version of STM32 tools I saw that in fact the C atomics were implemented using volatile for small types.
Volatile will work because the compiler may not optimize away any access to the variable that appears in the code.
However, the generated code must differ for types that cannot be loaded in a single instruction. If you use a volatile int64_t, the compiler will happily load it in two separate instructions. If the ISR runs between loading the two halves of the variable, you will load half the old value and half the new value.
Unfortunately using atomic<int64_t> may also fail with interrupt service routines if the implementation is not lock free. For Cortex-M, 64-bit accesses are not necessarily lockfree, so atomic should not be relied on without checking the implementation. Depending on the implementation, the system might deadlock if the locking mechanism is not reentrant and the interrupt happens while the lock is held. Since C++17, this can be queried by checking atomic<T>::is_always_lock_free. A specific answer for a specific atomic variable (this may depend on alignment) may be obtained by checking flagA.is_lock_free() since C++11.
So longer data must be protected by a separate mechanism (for example by turning off interrupts around the access and making the variable atomic or volatile.
So the correct way is to use std::atomic, as long as the access is lock free. If you are concerned about performance, it may pay off to select the appropriate memory order and stick to values that can be loaded in a single instruction.
Not using either would be wrong, the compiler will check the flag only once.
These functions all wait for a flag, but they get translated differently:
#include <atomic>
#include <cstdint>
using FlagT = std::int32_t;
volatile FlagT flag = 0;
void waitV()
{
while (!flag) {}
}
std::atomic<FlagT> flagA;
void waitA()
{
while(!flagA) {}
}
void waitRelaxed()
{
while(!flagA.load(std::memory_order_relaxed)) {}
}
FlagT wrongFlag;
void waitWrong()
{
while(!wrongFlag) {}
}
Using volatile you get a loop that reexamines the flag as you wanted:
waitV():
ldr r2, .L5
.L2:
ldr r3, [r2]
cmp r3, #0
beq .L2
bx lr
.L5:
.word .LANCHOR0
Atomic with the default sequentially consistent access produces synchronized access:
waitA():
push {r4, lr}
.L8:
bl __sync_synchronize
ldr r3, .L11
ldr r4, [r3, #4]
bl __sync_synchronize
cmp r4, #0
beq .L8
pop {r4}
pop {r0}
bx r0
.L11:
.word .LANCHOR0
If you do not care about the memory order you get a working loop just as with volatile:
waitRelaxed():
ldr r2, .L17
.L14:
ldr r3, [r2, #4]
cmp r3, #0
beq .L14
bx lr
.L17:
.word .LANCHOR0
Using neither volatile nor atomic will bite you with optimization enabled, as the flag is only checked once:
waitWrong():
ldr r3, .L24
ldr r3, [r3, #8]
cmp r3, #0
bne .L23
.L22: // infinite loop!
b .L22
.L23:
bx lr
.L24:
.word .LANCHOR0
flag:
flagA:
wrongFlag:

To understand the issue, you must first understand why volatile is needed in the first place.
There are three completely separate issues here:
Incorrect optimizations because the compiler doesn't realize that hardware callbacks such as ISRs are actually called.
Solution: volatile or compiler awareness.
Re-entrancy and race condition bugs caused by accessing a variable in several instructions and getting interrupted in the middle of it by an ISR using the same variable.
Solution: protected or atomic access with mutex, _Atomic, disabled interrupts etc.
Parallelism or pre-fetch cache bugs caused by instruction re-ordering, multi-core execution, branch prediction.
Solution: memory barriers or allocation/execution in memory areas that aren't cached. volatile access may or may not act as a memory barrier on some systems.
As soon as someone brings this kind of question up of SO, you always get lots of PC programmers babbling about 2 and 3 without knowing or understanding anything about 1. This is because they have never in their life written an ISR and PC compilers with multi-threading are generally aware that thread callbacks will get executed, so this isn't typically an issue in PC programs.
What you need to do to solve 1) in your case, is to see if the compiler actually generates code for reading while (!flag), with or without optimizations enabled. Disassemble and check.
Ideally, compiler documentation will tell that the compiler understands the meaning of some compiler-specific extension such as the non-standard keyword interrupt and upon spotting it make no assumptions about that function not getting called.
Sadly though, most compilers only use interrupt etc keywords to generate the right calling convention and return instructions. I recently encountered the missing volatile bug just a few weeks ago, upon helping someone on a SE site and they were using a modern ARM tool chain. So I don't trust compilers to handle this still, in the year 2020, unless they explicitly document it. When in doubt use volatile.
Regarding 2) and re-entrancy, modern compilers do support _Atomic nowadays, which makes things very easy. Use it is it's available and reliable on your compiler. Otherwise, for most bare metal systems you can utilize the fact that interrupts are non-interruptable and use a plain bool as a "mutex lite" (example), as long as there is no instruction re-ordering (unlikely case for most MCUs).
But please note that 2) is a separate issue not related to volatile. volatile does not solve thread-safe access. Thread-safe access does not solve incorrect optimizations. So don't mix these two unrelated concepts up in the same mess, as often seen on SO.

Of the commercial compilers I've tested that weren't based on gcc or clang, all of them would treat a read or write via volatile pointer or lvalue as being capable of accessing any other object, without regard for whether it would seem possible for the pointer or lvalue to hit the object in question. Some, such as MSVC, formally documented the fact that volatile writes have release semantics and volatile reads have acquire semantics, while others would require a read/write pair to achieve acquire semantics.
Such semantics make it possible to use volatile objects to build a mutex that can guard "ordinary" objects on systems with a strong memory model (including single-core systems with interrupts), or on compilers that apply acquire/release barriers at the hardware memory ordering level rather than merely the compiler ordering level.
Neither clang or gcc, however, offers any option other than -O0 which would offer such semantics, since they would impede "optimizations" that would otherwise be able to convert code that performs seemingly-redundant loads and stores [that are actually needed for correct operation] into "more efficient" code [that doesn't work]. To make one's code usable with those, I would recommend defining a 'memory clobber' macro (which for clang or gcc would be asm volatile ("" ::: "memory");) and invoking it between the action which needs to precede a volatile write and the write itself, or between a volatile read and the first action which would need to follow it. If one does that, that would allow one's code to be readily adapted to implementations that would neither support nor require such barriers, simply by defining the macro as an empty expansion.
Note that while some compilers interpret all asm directives as a memory clobber, and there wouldn't be any other purpose for an empty asm directive, gcc simply ignores empty asm directives rather than interpreting them in such fashion.
An example of a situation where gcc's optimizations would prove problematic (clang seems to handle this particular case correctly, but some others still pose problems):
short buffer[10];
volatile short volatile *tx_ptr;
volatile int tx_count;
void test(void)
{
buffer[0] = 1;
tx_ptr = buffer;
tx_count = 1;
while(tx_count)
;
buffer[0] = 2;
tx_ptr = buffer;
tx_count = 1;
while(tx_count)
;
}
GCC will decide to optimize out the assignment buffer[0]=1; because the Standard doesn't require it to recognize that storing the buffer's address into a volatile might have side effects that would interact with the value stored there.
[edit: further experimentation shows that icc will reorder accesses to volatile objects, but since it reorders them even with respect to each other, I'm not sure what to make of that, since that would seem broken by any imaginable interpretation of the Standard].

Short answer: always use std::atomic<T> whose is_lock_free() returns true.
Reasoning:
volatile can work reliably on simple architectures (single-core, no cache, ARM/Cortex-M) like STM32F2 or ATSAMG55 with e.g. IAR compiler. But...
It may fail to work as expected on more complex architectures (multi-core with cache) and when compiler tries to do certain optimisations (many examples in other answers, won't repeat that).
atomic_flag and atomic_int (if is_lock_free() which they should) are safe to use anywhere, because they work like volatile with added memory bariers / synchronization when needed (avoiding the problems in previous point).
The reason I specifically said you have to only use those with is_lock_free() being true is because you cannot stop IRQ as you could stop a thread. No, IRQ interrupts main loop and does its job, it cannot wait-lock on a mutex because it is blocking the main loop it would be waiting for.
Practical note: I personally either use atomic_flag (the one and only guaranteed to work) to implement sort of spin-lock, where ISR will disable itself when finding the lock locked, while main loop will always re-enable the ISR after unlocking. Or I use double-counter lock-free queue (SPSC - single producer, single consumer) using that atomit_int. (Have one reader-counter and one writer-counter, subtract to find the real count. Good for UART etc.)

What happens in assembly language when you call a method/function?

If I have a program in C++/C that (language doesn't matter much, just needed to illustrate a concept):
#include <iostream>
void foo() {
printf("in foo");
}
int main() {
foo();
return 0;
}
What happens in the assembly? I'm not actually looking for assembly code as I haven't gotten that far in it yet, but what's the basic principle?

In general, this is what happens:
Arguments to the function are stored on the stack. In platform specific order.
Location for return value is "allocated" on the stack
The return address for the function is also stored in the stack or in a special purpose CPU register.
The function (or actually, the address of the function) is called, either through a CPU specific call instruction or through a normal jmp or br instruction (jump/branch)
The function reads the arguments (if any) from the stack and the runs the function code
Return value from function is stored in the specified location (stack or special purpose CPU register)
Execution jumps back to the caller and the stack is cleared (by restoring the stack pointer to its initial value).
The details of the above vary from platform to platform and even from compiler to compiler (see e.g. STDCALL vs CDECL calling conventions). For instance, in some cases, CPU registers are used instead of storing stuff on the stack. The general idea is the same though

You can see it for yourself:
Under Linux 'compile' your program with:
gcc -S myprogram.c
And you'll get a listing of the programm in assembler (myprogram.s).
Of course you should know a little bit about assembler to understand it (but it's worth learning because it helps to understand how your computer works). Calling a function (on x86 architecture) is basically:
put variable a on stack
put variable b on stack
put variable n on stack
jump to address of the function
load variables from stack
do stuff in function
clean stack
jump back to main

What happens in the assembly?
A brief explanation: The current stack state is saved, a new stack is created and the code for the function to be executed is loaded and run. This involves inconveniencing a few registers of your microprocessor, some frantic to and fro read/writes to the memory and once done, the calling function's stack state is restored.

What happens? In x86, the first line of your main function might look something like:
call foo
The call instruction will push the return address on the stack and then jmp to the location of foo.

Arguments are pushed in stack and "call" instruction is made
Call is a simple "jmp" with pushing an address of instruction into stack ("ret" in the end of a method popping it and jumping on it)

I think you want to take a look at call stack to get a better idea what happens during a function call: http://en.wikipedia.org/wiki/Call_stack

A very good illustration:
http://www.cs.uleth.ca/~holzmann/C/system/memorylayout.pdf

What happens?
C mimics what will occur in assembly...
It is so close to machine that you can realize what will occur
void foo() {
printf("in foo");
/*
db mystring 'in foo'
mov eax, dword ptr mystring
mov edx , dword ptr _printf
push eax
call edx
add esp, 8
ret
//thats it
*/
}
int main() {
foo();
return 0;
}

1- a calling context is established on the stack
2- parameters are pushed on the stack
3- a "call" is performed to the method

The general idea is that you need to
Save the current local state
Pass the arguments to a function
Call the actual function. This involves putting the return address somewhere so the RET instruction knows where to continue.
The specifics vary from architecture to architecture. And the even more specific specifics might vary between various languages. Although there usually are ways of controlling this to some extent to allow for interoperability between different languages.
A pretty useful starting point is the Wikipedia article on calling conventions. On x86 for example the stack is almost always used for passing arguments to functions. On many RISC architectures, however, registers are mainly used while the stack is only needed in exceptional cases.

The common idea is that the registers that are used in the calling method are pushed on the stack (stack pointer is in ESP register), this process is called "push the registers". Sometimes they're also zeroed, but that depends. Assembly programmers tend to free more registers then the common 4 (EAX, EBX, ECX and EDX on x86) to have more possibilities within the function.
When the function ends, the same happens in the reverse: the stack is restored to the state from before calling. This is called "popping the registers".
Update: this process does not necessarily have to happen. Compilers can optimize it away and inline your functions.
Update: normally parameters of the function are pushed on the stack in reverse order, when they are retrieved from the stack, they appear as if in normal order. This order is not guaranteed by C. (ref: Inner Loops by Rick Booth)

Why does volatile exist?

What does the volatile keyword do? In C++ what problem does it solve?
In my case, I have never knowingly needed it.

volatile is needed if you are reading from a spot in memory that, say, a completely separate process/device/whatever may write to.
I used to work with dual-port ram in a multiprocessor system in straight C. We used a hardware managed 16 bit value as a semaphore to know when the other guy was done. Essentially we did this:
void waitForSemaphore()
{
volatile uint16_t* semPtr = WELL_KNOWN_SEM_ADDR;/*well known address to my semaphore*/
while ((*semPtr) != IS_OK_FOR_ME_TO_PROCEED);
}
Without volatile, the optimizer sees the loop as useless (The guy never sets the value! He's nuts, get rid of that code!) and my code would proceed without having acquired the semaphore, causing problems later on.

volatile is needed when developing embedded systems or device drivers, where you need to read or write a memory-mapped hardware device. The contents of a particular device register could change at any time, so you need the volatile keyword to ensure that such accesses aren't optimised away by the compiler.

Some processors have floating point registers that have more than 64 bits of precision (eg. 32-bit x86 without SSE, see Peter's comment). That way, if you run several operations on double-precision numbers, you actually get a higher-precision answer than if you were to truncate each intermediate result to 64 bits.
This is usually great, but it means that depending on how the compiler assigned registers and did optimizations you'll have different results for the exact same operations on the exact same inputs. If you need consistency then you can force each operation to go back to memory by using the volatile keyword.
It's also useful for some algorithms that make no algebraic sense but reduce floating point error, such as Kahan summation. Algebraicly it's a nop, so it will often get incorrectly optimized out unless some intermediate variables are volatile.

From a "Volatile as a promise" article by Dan Saks:
(...) a volatile object is one whose value might change spontaneously. That is, when you declare an object to be volatile, you're telling the compiler that the object might change state even though no statements in the program appear to change it."
Here are links to three of his articles regarding the volatile keyword:
Use volatile judiciously
Place volatile accurately
Volatile as a promise

You MUST use volatile when implementing lock-free data structures. Otherwise the compiler is free to optimize access to the variable, which will change the semantics.
To put it another way, volatile tells the compiler that accesses to this variable must correspond to a physical memory read/write operation.
For example, this is how InterlockedIncrement is declared in the Win32 API:
LONG __cdecl InterlockedIncrement(
__inout LONG volatile *Addend
);

A large application that I used to work on in the early 1990s contained C-based exception handling using setjmp and longjmp. The volatile keyword was necessary on variables whose values needed to be preserved in the block of code that served as the "catch" clause, lest those vars be stored in registers and wiped out by the longjmp.

In Standard C, one of the places to use volatile is with a signal handler. In fact, in Standard C, all you can safely do in a signal handler is modify a volatile sig_atomic_t variable, or exit quickly. Indeed, AFAIK, it is the only place in Standard C that the use of volatile is required to avoid undefined behaviour.
ISO/IEC 9899:2011 §7.14.1.1 The signal function
¶5 If the signal occurs other than as the result of calling the abort or raise function, the
behavior is undefined if the signal handler refers to any object with static or thread
storage duration that is not a lock-free atomic object other than by assigning a value to an
object declared as volatile sig_atomic_t, or the signal handler calls any function
in the standard library other than the abort function, the _Exit function, the
quick_exit function, or the signal function with the first argument equal to the
signal number corresponding to the signal that caused the invocation of the handler.
Furthermore, if such a call to the signal function results in a SIG_ERR return, the
value of errno is indeterminate.252)
252) If any signal is generated by an asynchronous signal handler, the behavior is undefined.
That means that in Standard C, you can write:
static volatile sig_atomic_t sig_num = 0;
static void sig_handler(int signum)
{
signal(signum, sig_handler);
sig_num = signum;
}
and not much else.
POSIX is a lot more lenient about what you can do in a signal handler, but there are still limitations (and one of the limitations is that the Standard I/O library — printf() et al — cannot be used safely).

Developing for an embedded, I have a loop that checks on a variable that can be changed in an interrupt handler. Without "volatile", the loop becomes a noop - as far as the compiler can tell, the variable never changes, so it optimizes the check away.
Same thing would apply to a variable that may be changed in a different thread in a more traditional environment, but there we often do synchronization calls, so compiler is not so free with optimization.

I've used it in debug builds when the compiler insists on optimizing away a variable that I want to be able to see as I step through code.

Besides using it as intended, volatile is used in (template) metaprogramming. It can be used to prevent accidental overloading, as the volatile attribute (like const) takes part in overload resolution.
template <typename T>
class Foo {
std::enable_if_t<sizeof(T)==4, void> f(T& t)
{ std::cout << 1 << t; }
void f(T volatile& t)
{ std::cout << 2 << const_cast<T&>(t); }
void bar() { T t; f(t); }
};
This is legal; both overloads are potentially callable and do almost the same. The cast in the volatile overload is legal as we know bar won't pass a non-volatile T anyway. The volatile version is strictly worse, though, so never chosen in overload resolution if the non-volatile f is available.
Note that the code never actually depends on volatile memory access.

you must use it to implement spinlocks as well as some (all?) lock-free data structures
use it with atomic operations/instructions
helped me once to overcome compiler's bug (wrongly generated code during optimization)

The volatile keyword is intended to prevent the compiler from applying any optimisations on objects that can change in ways that cannot be determined by the compiler.
Objects declared as volatile are omitted from optimisation because their values can be changed by code outside the scope of current code at any time. The system always reads the current value of a volatile object from the memory location rather than keeping its value in temporary register at the point it is requested, even if a previous instruction asked for a value from the same object.
Consider the following cases
1) Global variables modified by an interrupt service routine outside the scope.
2) Global variables within a multi-threaded application.
If we do not use volatile qualifier, the following problems may arise
1) Code may not work as expected when optimisation is turned on.
2) Code may not work as expected when interrupts are enabled and used.
Volatile: A programmer’s best friend
https://en.wikipedia.org/wiki/Volatile_(computer_programming)

Other answers already mention avoiding some optimization in order to:
use memory mapped registers (or "MMIO")
write device drivers
allow easier debugging of programs
make floating point computations more deterministic
Volatile is essential whenever you need a value to appear to come from the outside and be unpredictable and avoid compiler optimizations based on a value being known, and when a result isn't actually used but you need it to be computed, or it's used but you want to compute it several times for a benchmark, and you need the computations to start and end at precise points.
A volatile read is like an input operation (like scanf or a use of cin): the value seems to come from the outside of the program, so any computation that has a dependency on the value needs to start after it.
A volatile write is like an output operation (like printf or a use of cout): the value seems to be communicated outside of the program, so if the value depends on a computation, it needs to be finished before.
So a pair of volatile read/write can be used to tame benchmarks and make time measurement meaningful.
Without volatile, your computation could be started by the compiler before, as nothing would prevent reordering of computations with functions such as time measurement.

All answers are excellent. But on the top of that, I would like to share an example.
Below is a little cpp program:
#include <iostream>
int x;
int main(){
char buf[50];
x = 8;
if(x == 8)
printf("x is 8\n");
else
sprintf(buf, "x is not 8\n");
x=1000;
while(x > 5)
x--;
return 0;
}
Now, lets generate the assembly of the above code (and I will paste only that portions of the assembly which relevant here):
The command to generate assembly:
g++ -S -O3 -c -fverbose-asm -Wa,-adhln assembly.cpp
And the assembly:
main:
.LFB1594:
subq $40, %rsp #,
.seh_stackalloc 40
.seh_endprologue
# assembly.cpp:5: int main(){
call __main #
# assembly.cpp:10: printf("x is 8\n");
leaq .LC0(%rip), %rcx #,
# assembly.cpp:7: x = 8;
movl $8, x(%rip) #, x
# assembly.cpp:10: printf("x is 8\n");
call _ZL6printfPKcz.constprop.0 #
# assembly.cpp:18: }
xorl %eax, %eax #
movl $5, x(%rip) #, x
addq $40, %rsp #,
ret
.seh_endproc
.p2align 4,,15
.def _GLOBAL__sub_I_x; .scl 3; .type 32; .endef
.seh_proc _GLOBAL__sub_I_x
You can see in the assembly that the assembly code was not generated for sprintf because the compiler assumed that x will not change outside of the program. And same is the case with the while loop. while loop was altogether removed due to the optimization because compiler saw it as a useless code and thus directly assigned 5 to x (see movl $5, x(%rip)).
The problem occurs when what if an external process/ hardware would change the value of x somewhere between x = 8; and if(x == 8). We would expect else block to work but unfortunately the compiler has trimmed out that part.
Now, in order to solve this, in the assembly.cpp, let us change int x; to volatile int x; and quickly see the assembly code generated:
main:
.LFB1594:
subq $104, %rsp #,
.seh_stackalloc 104
.seh_endprologue
# assembly.cpp:5: int main(){
call __main #
# assembly.cpp:7: x = 8;
movl $8, x(%rip) #, x
# assembly.cpp:9: if(x == 8)
movl x(%rip), %eax # x, x.1_1
# assembly.cpp:9: if(x == 8)
cmpl $8, %eax #, x.1_1
je .L11 #,
# assembly.cpp:12: sprintf(buf, "x is not 8\n");
leaq 32(%rsp), %rcx #, tmp93
leaq .LC0(%rip), %rdx #,
call _ZL7sprintfPcPKcz.constprop.0 #
.L7:
# assembly.cpp:14: x=1000;
movl $1000, x(%rip) #, x
# assembly.cpp:15: while(x > 5)
movl x(%rip), %eax # x, x.3_15
cmpl $5, %eax #, x.3_15
jle .L8 #,
.p2align 4,,10
.L9:
# assembly.cpp:16: x--;
movl x(%rip), %eax # x, x.4_3
subl $1, %eax #, _4
movl %eax, x(%rip) # _4, x
# assembly.cpp:15: while(x > 5)
movl x(%rip), %eax # x, x.3_2
cmpl $5, %eax #, x.3_2
jg .L9 #,
.L8:
# assembly.cpp:18: }
xorl %eax, %eax #
addq $104, %rsp #,
ret
.L11:
# assembly.cpp:10: printf("x is 8\n");
leaq .LC1(%rip), %rcx #,
call _ZL6printfPKcz.constprop.1 #
jmp .L7 #
.seh_endproc
.p2align 4,,15
.def _GLOBAL__sub_I_x; .scl 3; .type 32; .endef
.seh_proc _GLOBAL__sub_I_x
Here you can see that the assembly codes for sprintf, printf and while loop were generated. The advantage is that if the x variable is changed by some external program or hardware, sprintf part of the code will be executed. And similarly while loop can be used for busy waiting now.

Beside the fact that the volatile keyword is used for telling the compiler not to optimize the access to some variable (that can be modified by a thread or an interrupt routine), it can be also used to remove some compiler bugs -- YES it can be ---.
For example I worked on an embedded platform were the compiler was making some wrong assuptions regarding a value of a variable. If the code wasn't optimized the program would run ok. With optimizations (which were really needed because it was a critical routine) the code wouldn't work correctly. The only solution (though not very correct) was to declare the 'faulty' variable as volatile.

Your program seems to work even without volatile keyword? Perhaps this is the reason:
As mentioned previously the volatile keyword helps for cases like
volatile int* p = ...; // point to some memory
while( *p!=0 ) {} // loop until the memory becomes zero
But there seems to be almost no effect once an external or non-inline function is being called. E.g.:
while( *p!=0 ) { g(); }
Then with or without volatile almost the same result is generated.
As long as g() can be completely inlined, the compiler can see everything that's going on and can therefore optimize. But when the program makes a call to a place where the compiler can't see what's going on, it isn't safe for the compiler to make any assumptions any more. Hence the compiler will generate code that always reads from memory directly.
But beware of the day, when your function g() becomes inline (either due to explicit changes or due to compiler/linker cleverness) then your code might break if you forgot the volatile keyword!
Therefore I recommend to add the volatile keyword even if your program seems to work without. It makes the intention clearer and more robust in respect to future changes.

In the early days of C, compilers would interpret all actions that read and write lvalues as memory operations, to be performed in the same sequence as the reads and writes appeared in the code. Efficiency could be greatly improved in many cases if compilers were given a certain amount of freedom to re-order and consolidate operations, but there was a problem with this. Even though operations were often specified in a certain order merely because it was necessary to specify them in some order, and thus the programmer picked one of many equally-good alternatives, that wasn't always the case. Sometimes it would be important that certain operations occur in a particular sequence.
Exactly which details of sequencing are important will vary depending upon the target platform and application field. Rather than provide particularly detailed control, the Standard opted for a simple model: if a sequence of accesses are done with lvalues that are not qualified volatile, a compiler may reorder and consolidate them as it sees fit. If an action is done with a volatile-qualified lvalue, a quality implementation should offer whatever additional ordering guarantees might be required by code targeting its intended platform and application field, without requiring that programmers use non-standard syntax.
Unfortunately, rather than identify what guarantees programmers would need, many compilers have opted instead to offer the bare minimum guarantees mandated by the Standard. This makes volatile much less useful than it should be. On gcc or clang, for example, a programmer needing to implement a basic "hand-off mutex" [one where a task that has acquired and released a mutex won't do so again until the other task has done so] must do one of four things:
Put the acquisition and release of the mutex in a function that the compiler cannot inline, and to which it cannot apply Whole Program Optimization.
Qualify all the objects guarded by the mutex as volatile--something which shouldn't be necessary if all accesses occur after acquiring the mutex and before releasing it.
Use optimization level 0 to force the compiler to generate code as though all objects that aren't qualified register are volatile.
Use gcc-specific directives.
By contrast, when using a higher-quality compiler which is more suitable for systems programming, such as icc, one would have another option:
Make sure that a volatile-qualified write gets performed everyplace an acquire or release is needed.
Acquiring a basic "hand-off mutex" requires a volatile read (to see if it's ready), and shouldn't require a volatile write as well (the other side won't try to re-acquire it until it's handed back) but having to perform a meaningless volatile write is still better than any of the options available under gcc or clang.

One use I should remind you is, in the signal handler function, if you want to access/modify a global variable (for example, mark it as exit = true) you have to declare that variable as 'volatile'.

I would like to quote Herb Sutter's words from his GotW #95, which can help to understand the meaning of the volatile variables:
C++ volatile variables (which have no analog in languages like C# and Java) are always beyond the scope of this and any other article about the memory model and synchronization. That’s because C++ volatile variables aren’t about threads or communication at all and don’t interact with those things. Rather, a C++ volatile variable should be viewed as portal into a different universe beyond the language — a memory location that by definition does not obey the language’s memory model because that memory location is accessed by hardware (e.g., written to by a daughter card), have more than one address, or is otherwise “strange” and beyond the language. So C++ volatile variables are universally an exception to every guideline about synchronization because are always inherently “racy” and unsynchronizable using the normal tools (mutexes, atomics, etc.) and more generally exist outside all normal of the language and compiler including that they generally cannot be optimized by the compiler (because the compiler isn’t allowed to know their semantics; a volatile int vi; may not behave anything like a normal int, and you can’t even assume that code like vi = 5; int read_back = vi; is guaranteed to result in read_back == 5, or that code like int i = vi; int j = vi; that reads vi twice will result in i == j which will not be true if vi is a hardware counter for example).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js