Why does volatile exist? - c++

What does the volatile keyword do? In C++ what problem does it solve?
In my case, I have never knowingly needed it.

volatile is needed if you are reading from a spot in memory that, say, a completely separate process/device/whatever may write to.
I used to work with dual-port ram in a multiprocessor system in straight C. We used a hardware managed 16 bit value as a semaphore to know when the other guy was done. Essentially we did this:
void waitForSemaphore()
{
volatile uint16_t* semPtr = WELL_KNOWN_SEM_ADDR;/*well known address to my semaphore*/
while ((*semPtr) != IS_OK_FOR_ME_TO_PROCEED);
}
Without volatile, the optimizer sees the loop as useless (The guy never sets the value! He's nuts, get rid of that code!) and my code would proceed without having acquired the semaphore, causing problems later on.

volatile is needed when developing embedded systems or device drivers, where you need to read or write a memory-mapped hardware device. The contents of a particular device register could change at any time, so you need the volatile keyword to ensure that such accesses aren't optimised away by the compiler.

Some processors have floating point registers that have more than 64 bits of precision (eg. 32-bit x86 without SSE, see Peter's comment). That way, if you run several operations on double-precision numbers, you actually get a higher-precision answer than if you were to truncate each intermediate result to 64 bits.
This is usually great, but it means that depending on how the compiler assigned registers and did optimizations you'll have different results for the exact same operations on the exact same inputs. If you need consistency then you can force each operation to go back to memory by using the volatile keyword.
It's also useful for some algorithms that make no algebraic sense but reduce floating point error, such as Kahan summation. Algebraicly it's a nop, so it will often get incorrectly optimized out unless some intermediate variables are volatile.

From a "Volatile as a promise" article by Dan Saks:
(...) a volatile object is one whose value might change spontaneously. That is, when you declare an object to be volatile, you're telling the compiler that the object might change state even though no statements in the program appear to change it."
Here are links to three of his articles regarding the volatile keyword:
Use volatile judiciously
Place volatile accurately
Volatile as a promise

You MUST use volatile when implementing lock-free data structures. Otherwise the compiler is free to optimize access to the variable, which will change the semantics.
To put it another way, volatile tells the compiler that accesses to this variable must correspond to a physical memory read/write operation.
For example, this is how InterlockedIncrement is declared in the Win32 API:
LONG __cdecl InterlockedIncrement(
__inout LONG volatile *Addend
);

A large application that I used to work on in the early 1990s contained C-based exception handling using setjmp and longjmp. The volatile keyword was necessary on variables whose values needed to be preserved in the block of code that served as the "catch" clause, lest those vars be stored in registers and wiped out by the longjmp.

In Standard C, one of the places to use volatile is with a signal handler. In fact, in Standard C, all you can safely do in a signal handler is modify a volatile sig_atomic_t variable, or exit quickly. Indeed, AFAIK, it is the only place in Standard C that the use of volatile is required to avoid undefined behaviour.
ISO/IEC 9899:2011 §7.14.1.1 The signal function
¶5 If the signal occurs other than as the result of calling the abort or raise function, the
behavior is undefined if the signal handler refers to any object with static or thread
storage duration that is not a lock-free atomic object other than by assigning a value to an
object declared as volatile sig_atomic_t, or the signal handler calls any function
in the standard library other than the abort function, the _Exit function, the
quick_exit function, or the signal function with the first argument equal to the
signal number corresponding to the signal that caused the invocation of the handler.
Furthermore, if such a call to the signal function results in a SIG_ERR return, the
value of errno is indeterminate.252)
252) If any signal is generated by an asynchronous signal handler, the behavior is undefined.
That means that in Standard C, you can write:
static volatile sig_atomic_t sig_num = 0;
static void sig_handler(int signum)
{
signal(signum, sig_handler);
sig_num = signum;
}
and not much else.
POSIX is a lot more lenient about what you can do in a signal handler, but there are still limitations (and one of the limitations is that the Standard I/O library — printf() et al — cannot be used safely).

Developing for an embedded, I have a loop that checks on a variable that can be changed in an interrupt handler. Without "volatile", the loop becomes a noop - as far as the compiler can tell, the variable never changes, so it optimizes the check away.
Same thing would apply to a variable that may be changed in a different thread in a more traditional environment, but there we often do synchronization calls, so compiler is not so free with optimization.

I've used it in debug builds when the compiler insists on optimizing away a variable that I want to be able to see as I step through code.

Besides using it as intended, volatile is used in (template) metaprogramming. It can be used to prevent accidental overloading, as the volatile attribute (like const) takes part in overload resolution.
template <typename T>
class Foo {
std::enable_if_t<sizeof(T)==4, void> f(T& t)
{ std::cout << 1 << t; }
void f(T volatile& t)
{ std::cout << 2 << const_cast<T&>(t); }
void bar() { T t; f(t); }
};
This is legal; both overloads are potentially callable and do almost the same. The cast in the volatile overload is legal as we know bar won't pass a non-volatile T anyway. The volatile version is strictly worse, though, so never chosen in overload resolution if the non-volatile f is available.
Note that the code never actually depends on volatile memory access.

you must use it to implement spinlocks as well as some (all?) lock-free data structures
use it with atomic operations/instructions
helped me once to overcome compiler's bug (wrongly generated code during optimization)

The volatile keyword is intended to prevent the compiler from applying any optimisations on objects that can change in ways that cannot be determined by the compiler.
Objects declared as volatile are omitted from optimisation because their values can be changed by code outside the scope of current code at any time. The system always reads the current value of a volatile object from the memory location rather than keeping its value in temporary register at the point it is requested, even if a previous instruction asked for a value from the same object.
Consider the following cases
1) Global variables modified by an interrupt service routine outside the scope.
2) Global variables within a multi-threaded application.
If we do not use volatile qualifier, the following problems may arise
1) Code may not work as expected when optimisation is turned on.
2) Code may not work as expected when interrupts are enabled and used.
Volatile: A programmer’s best friend
https://en.wikipedia.org/wiki/Volatile_(computer_programming)

Other answers already mention avoiding some optimization in order to:
use memory mapped registers (or "MMIO")
write device drivers
allow easier debugging of programs
make floating point computations more deterministic
Volatile is essential whenever you need a value to appear to come from the outside and be unpredictable and avoid compiler optimizations based on a value being known, and when a result isn't actually used but you need it to be computed, or it's used but you want to compute it several times for a benchmark, and you need the computations to start and end at precise points.
A volatile read is like an input operation (like scanf or a use of cin): the value seems to come from the outside of the program, so any computation that has a dependency on the value needs to start after it.
A volatile write is like an output operation (like printf or a use of cout): the value seems to be communicated outside of the program, so if the value depends on a computation, it needs to be finished before.
So a pair of volatile read/write can be used to tame benchmarks and make time measurement meaningful.
Without volatile, your computation could be started by the compiler before, as nothing would prevent reordering of computations with functions such as time measurement.

All answers are excellent. But on the top of that, I would like to share an example.
Below is a little cpp program:
#include <iostream>
int x;
int main(){
char buf[50];
x = 8;
if(x == 8)
printf("x is 8\n");
else
sprintf(buf, "x is not 8\n");
x=1000;
while(x > 5)
x--;
return 0;
}
Now, lets generate the assembly of the above code (and I will paste only that portions of the assembly which relevant here):
The command to generate assembly:
g++ -S -O3 -c -fverbose-asm -Wa,-adhln assembly.cpp
And the assembly:
main:
.LFB1594:
subq $40, %rsp #,
.seh_stackalloc 40
.seh_endprologue
# assembly.cpp:5: int main(){
call __main #
# assembly.cpp:10: printf("x is 8\n");
leaq .LC0(%rip), %rcx #,
# assembly.cpp:7: x = 8;
movl $8, x(%rip) #, x
# assembly.cpp:10: printf("x is 8\n");
call _ZL6printfPKcz.constprop.0 #
# assembly.cpp:18: }
xorl %eax, %eax #
movl $5, x(%rip) #, x
addq $40, %rsp #,
ret
.seh_endproc
.p2align 4,,15
.def _GLOBAL__sub_I_x; .scl 3; .type 32; .endef
.seh_proc _GLOBAL__sub_I_x
You can see in the assembly that the assembly code was not generated for sprintf because the compiler assumed that x will not change outside of the program. And same is the case with the while loop. while loop was altogether removed due to the optimization because compiler saw it as a useless code and thus directly assigned 5 to x (see movl $5, x(%rip)).
The problem occurs when what if an external process/ hardware would change the value of x somewhere between x = 8; and if(x == 8). We would expect else block to work but unfortunately the compiler has trimmed out that part.
Now, in order to solve this, in the assembly.cpp, let us change int x; to volatile int x; and quickly see the assembly code generated:
main:
.LFB1594:
subq $104, %rsp #,
.seh_stackalloc 104
.seh_endprologue
# assembly.cpp:5: int main(){
call __main #
# assembly.cpp:7: x = 8;
movl $8, x(%rip) #, x
# assembly.cpp:9: if(x == 8)
movl x(%rip), %eax # x, x.1_1
# assembly.cpp:9: if(x == 8)
cmpl $8, %eax #, x.1_1
je .L11 #,
# assembly.cpp:12: sprintf(buf, "x is not 8\n");
leaq 32(%rsp), %rcx #, tmp93
leaq .LC0(%rip), %rdx #,
call _ZL7sprintfPcPKcz.constprop.0 #
.L7:
# assembly.cpp:14: x=1000;
movl $1000, x(%rip) #, x
# assembly.cpp:15: while(x > 5)
movl x(%rip), %eax # x, x.3_15
cmpl $5, %eax #, x.3_15
jle .L8 #,
.p2align 4,,10
.L9:
# assembly.cpp:16: x--;
movl x(%rip), %eax # x, x.4_3
subl $1, %eax #, _4
movl %eax, x(%rip) # _4, x
# assembly.cpp:15: while(x > 5)
movl x(%rip), %eax # x, x.3_2
cmpl $5, %eax #, x.3_2
jg .L9 #,
.L8:
# assembly.cpp:18: }
xorl %eax, %eax #
addq $104, %rsp #,
ret
.L11:
# assembly.cpp:10: printf("x is 8\n");
leaq .LC1(%rip), %rcx #,
call _ZL6printfPKcz.constprop.1 #
jmp .L7 #
.seh_endproc
.p2align 4,,15
.def _GLOBAL__sub_I_x; .scl 3; .type 32; .endef
.seh_proc _GLOBAL__sub_I_x
Here you can see that the assembly codes for sprintf, printf and while loop were generated. The advantage is that if the x variable is changed by some external program or hardware, sprintf part of the code will be executed. And similarly while loop can be used for busy waiting now.

Beside the fact that the volatile keyword is used for telling the compiler not to optimize the access to some variable (that can be modified by a thread or an interrupt routine), it can be also used to remove some compiler bugs -- YES it can be ---.
For example I worked on an embedded platform were the compiler was making some wrong assuptions regarding a value of a variable. If the code wasn't optimized the program would run ok. With optimizations (which were really needed because it was a critical routine) the code wouldn't work correctly. The only solution (though not very correct) was to declare the 'faulty' variable as volatile.

Your program seems to work even without volatile keyword? Perhaps this is the reason:
As mentioned previously the volatile keyword helps for cases like
volatile int* p = ...; // point to some memory
while( *p!=0 ) {} // loop until the memory becomes zero
But there seems to be almost no effect once an external or non-inline function is being called. E.g.:
while( *p!=0 ) { g(); }
Then with or without volatile almost the same result is generated.
As long as g() can be completely inlined, the compiler can see everything that's going on and can therefore optimize. But when the program makes a call to a place where the compiler can't see what's going on, it isn't safe for the compiler to make any assumptions any more. Hence the compiler will generate code that always reads from memory directly.
But beware of the day, when your function g() becomes inline (either due to explicit changes or due to compiler/linker cleverness) then your code might break if you forgot the volatile keyword!
Therefore I recommend to add the volatile keyword even if your program seems to work without. It makes the intention clearer and more robust in respect to future changes.

In the early days of C, compilers would interpret all actions that read and write lvalues as memory operations, to be performed in the same sequence as the reads and writes appeared in the code. Efficiency could be greatly improved in many cases if compilers were given a certain amount of freedom to re-order and consolidate operations, but there was a problem with this. Even though operations were often specified in a certain order merely because it was necessary to specify them in some order, and thus the programmer picked one of many equally-good alternatives, that wasn't always the case. Sometimes it would be important that certain operations occur in a particular sequence.
Exactly which details of sequencing are important will vary depending upon the target platform and application field. Rather than provide particularly detailed control, the Standard opted for a simple model: if a sequence of accesses are done with lvalues that are not qualified volatile, a compiler may reorder and consolidate them as it sees fit. If an action is done with a volatile-qualified lvalue, a quality implementation should offer whatever additional ordering guarantees might be required by code targeting its intended platform and application field, without requiring that programmers use non-standard syntax.
Unfortunately, rather than identify what guarantees programmers would need, many compilers have opted instead to offer the bare minimum guarantees mandated by the Standard. This makes volatile much less useful than it should be. On gcc or clang, for example, a programmer needing to implement a basic "hand-off mutex" [one where a task that has acquired and released a mutex won't do so again until the other task has done so] must do one of four things:
Put the acquisition and release of the mutex in a function that the compiler cannot inline, and to which it cannot apply Whole Program Optimization.
Qualify all the objects guarded by the mutex as volatile--something which shouldn't be necessary if all accesses occur after acquiring the mutex and before releasing it.
Use optimization level 0 to force the compiler to generate code as though all objects that aren't qualified register are volatile.
Use gcc-specific directives.
By contrast, when using a higher-quality compiler which is more suitable for systems programming, such as icc, one would have another option:
Make sure that a volatile-qualified write gets performed everyplace an acquire or release is needed.
Acquiring a basic "hand-off mutex" requires a volatile read (to see if it's ready), and shouldn't require a volatile write as well (the other side won't try to re-acquire it until it's handed back) but having to perform a meaningless volatile write is still better than any of the options available under gcc or clang.

One use I should remind you is, in the signal handler function, if you want to access/modify a global variable (for example, mark it as exit = true) you have to declare that variable as 'volatile'.

I would like to quote Herb Sutter's words from his GotW #95, which can help to understand the meaning of the volatile variables:
C++ volatile variables (which have no analog in languages like C# and Java) are always beyond the scope of this and any other article about the memory model and synchronization. That’s because C++ volatile variables aren’t about threads or communication at all and don’t interact with those things. Rather, a C++ volatile variable should be viewed as portal into a different universe beyond the language — a memory location that by definition does not obey the language’s memory model because that memory location is accessed by hardware (e.g., written to by a daughter card), have more than one address, or is otherwise “strange” and beyond the language. So C++ volatile variables are universally an exception to every guideline about synchronization because are always inherently “racy” and unsynchronizable using the normal tools (mutexes, atomics, etc.) and more generally exist outside all normal of the language and compiler including that they generally cannot be optimized by the compiler (because the compiler isn’t allowed to know their semantics; a volatile int vi; may not behave anything like a normal int, and you can’t even assume that code like vi = 5; int read_back = vi; is guaranteed to result in read_back == 5, or that code like int i = vi; int j = vi; that reads vi twice will result in i == j which will not be true if vi is a hardware counter for example).

Related

Why isn't [[carries_dependency]] the default in C++?

I know that memory_order_consume has been deprecated, but I'm trying to understand the logic that went into the original design and how [[carries_dependency]] and kill_dependency were supposed to work. For that, I would like a specific example of code that would break on an IBM PowerPC or DEC alpha or even a hypothetical architecture with a hypothetical compiler that fully implemented consume semantics in C++11 or C++14.
The best I can come up with is an example like this:
int v;
std::atomic<int*> ap;
void
thread_1()
{
v = 1;
ap.store(&v, std::memory_order_release);
}
int
f(int *p [[carries_dependency]])
{
return v;
}
void
thread_2()
{
int *p;
while (!(p = ap.load(std::memory_order_consume)))
;
int v2 = f(p);
assert(*p == v2);
}
I understand that the assertion could fail in this code. However, is it the case that the assertion is not supposed to fail if you remove [[carries_dependency]] from f? If so, why is that the case? After all, you requested a memory_order_consume, so why would you expect other accesses to v to reflect acquire semantics? If removing [[carries_dependency]] does not make the code correct, then what's an example where [[carries_dependency]] (or making [[carries_dependency]] the default for all variables) breaks otherwise correct code?
The only thing I can think is that maybe this has to do with register spills? If a function spills a register onto the stack and later re-loads it, this could break the dependency chain. So maybe [[carries_dependency]] makes things efficient in some cases (says no need to issue memory barrier in the caller before calling this function) but also requires the callee to issue a memory barrier before any register spills or calling another function, which could be less efficient in other cases? I'm grasping at straws here, though, so would still love to hear from someone who understands this stuff...
return v doesn't have a data dependency on int *p, so you'd need acquire not consume for ap.load(consume) / f(p) to synchronize with the release store.
If you'd used return *p then this would be sufficient thanks to dependency ordering, because that load would have a data dependency on the earlier load, no way for the CPU to generate the address earlier and thus load from v before the load from ap that saw the value you were waiting for.
Promoting dropping the dependency-ordering stuff effectively requires promoting consume to acquire by using a memory barrier before the function call.
DEC Alpha would always need a barrier even for consume to work, i.e. it had to promote consume to acquire because the ISA didn't guarantee dependency ordering by the hardware.
Some ISAs (mostly just x86) are so strongly ordered they don't need a barrier because every load is an acquire load, not reordered with other loads. Or at least giving the illusion of not being reordered; actual implementations speculatively load early but nuke the pipeline if mis-speculation is detected, i.e. where a cache line isn't still valid by the time the load was architecturally allowed to happen.
So x86 and Alpha would likely still work even with the [[carries-dependency]] version, because they're either too strong or too weak for mo_consume to be something the hardware can actually do (more cheaply than mo_acquire).
For Alpha it would depend where the compiler put the barrier; it could put it after f(p)'s return v, only before the *p that actually depends on the consume load. Or it could just promote consume to acquire on the spot at the load, like compilers do now (since consume is deprecated after proving too hard to support in its current design.)
As for why ISO C++11 decided to promote consume results to effectively acquire when passing across function boundaries, that might have been a usability consideration. But also performance. Without that, compilers would lose the ability to do some optimizations on incoming function args.
e.g. int ready = foo.load(consume); / if(ready == 1) return non_atomic[ready-ready]; is required to make asm that has a data dependency on the consume-load result, unlike normal when the compiler would just optimize it to return *non_atomic.
(You might be familiar with x86 xor eax,eax being a good way to zero a register. In weakly-ordered ISAs that guarantee dependency ordering, like ARM, eor r0, r0,r0 is guaranteed not to break the dependency on the old value of r0.)
Also constant-propagation in branches like if(ready == 1) should be possible; code inside that if can only run when ready has the constant value 1. So even if we weren't cancelling it out, non_atomic[ready] can't be optimized to non_atomic[1];.
If every incoming function arg potentially carried a dependency, compilers would not be able to do those usual optimizations on any function args or values derived from them.
Related re: what consume is about and/or its deprecation, and that it's still used in a hand-rolled way with volatile in Linux kernel code (e.g. RCU):
[[carries_dependency]] what it means and how to implement
When should you not use [[carries_dependency]]?

Should volatile still be used for sharing data with ISRs in modern C++?

I've seen some flavors of these question around and I've seen mixed answers, still unsure whether they are up-to-date and fully apply to my use case, so I'll ask here. Do let me know if it's a duplicate!
Given that I'm developing for STM32 microcontrollers (bare-metal) using C++17 and the gcc-arm-none-eabi-9 toolchain:
Do I still need to use volatile for sharing data between an ISR and main()?
volatile std::int32_t flag = 0;
extern "C" void ISR()
{
flag = 1;
}
int main()
{
while (!flag) { ... }
}
It's clear to me that I should always use volatile for accessing memory-mapped HW registers.
However for the ISR use case I don't know if it can be considered a case of "multithreading" or not. In that case, people recommend using C++11's new threading features (e.g. std::atomic). I'm aware of the difference between volatile (don't optimize) and atomic (safe access), so the answers suggesting std::atomic confuse me here.
For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile.
In other words: can the compiler know that flag can change inside ISR? If not, how can it know it in regular multithreaded applications?
Thanks!
I think that in this case both volatile and atomic will most likely work in practice on the 32 bit ARM. At least in an older version of STM32 tools I saw that in fact the C atomics were implemented using volatile for small types.
Volatile will work because the compiler may not optimize away any access to the variable that appears in the code.
However, the generated code must differ for types that cannot be loaded in a single instruction. If you use a volatile int64_t, the compiler will happily load it in two separate instructions. If the ISR runs between loading the two halves of the variable, you will load half the old value and half the new value.
Unfortunately using atomic<int64_t> may also fail with interrupt service routines if the implementation is not lock free. For Cortex-M, 64-bit accesses are not necessarily lockfree, so atomic should not be relied on without checking the implementation. Depending on the implementation, the system might deadlock if the locking mechanism is not reentrant and the interrupt happens while the lock is held. Since C++17, this can be queried by checking atomic<T>::is_always_lock_free. A specific answer for a specific atomic variable (this may depend on alignment) may be obtained by checking flagA.is_lock_free() since C++11.
So longer data must be protected by a separate mechanism (for example by turning off interrupts around the access and making the variable atomic or volatile.
So the correct way is to use std::atomic, as long as the access is lock free. If you are concerned about performance, it may pay off to select the appropriate memory order and stick to values that can be loaded in a single instruction.
Not using either would be wrong, the compiler will check the flag only once.
These functions all wait for a flag, but they get translated differently:
#include <atomic>
#include <cstdint>
using FlagT = std::int32_t;
volatile FlagT flag = 0;
void waitV()
{
while (!flag) {}
}
std::atomic<FlagT> flagA;
void waitA()
{
while(!flagA) {}
}
void waitRelaxed()
{
while(!flagA.load(std::memory_order_relaxed)) {}
}
FlagT wrongFlag;
void waitWrong()
{
while(!wrongFlag) {}
}
Using volatile you get a loop that reexamines the flag as you wanted:
waitV():
ldr r2, .L5
.L2:
ldr r3, [r2]
cmp r3, #0
beq .L2
bx lr
.L5:
.word .LANCHOR0
Atomic with the default sequentially consistent access produces synchronized access:
waitA():
push {r4, lr}
.L8:
bl __sync_synchronize
ldr r3, .L11
ldr r4, [r3, #4]
bl __sync_synchronize
cmp r4, #0
beq .L8
pop {r4}
pop {r0}
bx r0
.L11:
.word .LANCHOR0
If you do not care about the memory order you get a working loop just as with volatile:
waitRelaxed():
ldr r2, .L17
.L14:
ldr r3, [r2, #4]
cmp r3, #0
beq .L14
bx lr
.L17:
.word .LANCHOR0
Using neither volatile nor atomic will bite you with optimization enabled, as the flag is only checked once:
waitWrong():
ldr r3, .L24
ldr r3, [r3, #8]
cmp r3, #0
bne .L23
.L22: // infinite loop!
b .L22
.L23:
bx lr
.L24:
.word .LANCHOR0
flag:
flagA:
wrongFlag:
To understand the issue, you must first understand why volatile is needed in the first place.
There are three completely separate issues here:
Incorrect optimizations because the compiler doesn't realize that hardware callbacks such as ISRs are actually called.
Solution: volatile or compiler awareness.
Re-entrancy and race condition bugs caused by accessing a variable in several instructions and getting interrupted in the middle of it by an ISR using the same variable.
Solution: protected or atomic access with mutex, _Atomic, disabled interrupts etc.
Parallelism or pre-fetch cache bugs caused by instruction re-ordering, multi-core execution, branch prediction.
Solution: memory barriers or allocation/execution in memory areas that aren't cached. volatile access may or may not act as a memory barrier on some systems.
As soon as someone brings this kind of question up of SO, you always get lots of PC programmers babbling about 2 and 3 without knowing or understanding anything about 1. This is because they have never in their life written an ISR and PC compilers with multi-threading are generally aware that thread callbacks will get executed, so this isn't typically an issue in PC programs.
What you need to do to solve 1) in your case, is to see if the compiler actually generates code for reading while (!flag), with or without optimizations enabled. Disassemble and check.
Ideally, compiler documentation will tell that the compiler understands the meaning of some compiler-specific extension such as the non-standard keyword interrupt and upon spotting it make no assumptions about that function not getting called.
Sadly though, most compilers only use interrupt etc keywords to generate the right calling convention and return instructions. I recently encountered the missing volatile bug just a few weeks ago, upon helping someone on a SE site and they were using a modern ARM tool chain. So I don't trust compilers to handle this still, in the year 2020, unless they explicitly document it. When in doubt use volatile.
Regarding 2) and re-entrancy, modern compilers do support _Atomic nowadays, which makes things very easy. Use it is it's available and reliable on your compiler. Otherwise, for most bare metal systems you can utilize the fact that interrupts are non-interruptable and use a plain bool as a "mutex lite" (example), as long as there is no instruction re-ordering (unlikely case for most MCUs).
But please note that 2) is a separate issue not related to volatile. volatile does not solve thread-safe access. Thread-safe access does not solve incorrect optimizations. So don't mix these two unrelated concepts up in the same mess, as often seen on SO.
Of the commercial compilers I've tested that weren't based on gcc or clang, all of them would treat a read or write via volatile pointer or lvalue as being capable of accessing any other object, without regard for whether it would seem possible for the pointer or lvalue to hit the object in question. Some, such as MSVC, formally documented the fact that volatile writes have release semantics and volatile reads have acquire semantics, while others would require a read/write pair to achieve acquire semantics.
Such semantics make it possible to use volatile objects to build a mutex that can guard "ordinary" objects on systems with a strong memory model (including single-core systems with interrupts), or on compilers that apply acquire/release barriers at the hardware memory ordering level rather than merely the compiler ordering level.
Neither clang or gcc, however, offers any option other than -O0 which would offer such semantics, since they would impede "optimizations" that would otherwise be able to convert code that performs seemingly-redundant loads and stores [that are actually needed for correct operation] into "more efficient" code [that doesn't work]. To make one's code usable with those, I would recommend defining a 'memory clobber' macro (which for clang or gcc would be asm volatile ("" ::: "memory");) and invoking it between the action which needs to precede a volatile write and the write itself, or between a volatile read and the first action which would need to follow it. If one does that, that would allow one's code to be readily adapted to implementations that would neither support nor require such barriers, simply by defining the macro as an empty expansion.
Note that while some compilers interpret all asm directives as a memory clobber, and there wouldn't be any other purpose for an empty asm directive, gcc simply ignores empty asm directives rather than interpreting them in such fashion.
An example of a situation where gcc's optimizations would prove problematic (clang seems to handle this particular case correctly, but some others still pose problems):
short buffer[10];
volatile short volatile *tx_ptr;
volatile int tx_count;
void test(void)
{
buffer[0] = 1;
tx_ptr = buffer;
tx_count = 1;
while(tx_count)
;
buffer[0] = 2;
tx_ptr = buffer;
tx_count = 1;
while(tx_count)
;
}
GCC will decide to optimize out the assignment buffer[0]=1; because the Standard doesn't require it to recognize that storing the buffer's address into a volatile might have side effects that would interact with the value stored there.
[edit: further experimentation shows that icc will reorder accesses to volatile objects, but since it reorders them even with respect to each other, I'm not sure what to make of that, since that would seem broken by any imaginable interpretation of the Standard].
Short answer: always use std::atomic<T> whose is_lock_free() returns true.
Reasoning:
volatile can work reliably on simple architectures (single-core, no cache, ARM/Cortex-M) like STM32F2 or ATSAMG55 with e.g. IAR compiler. But...
It may fail to work as expected on more complex architectures (multi-core with cache) and when compiler tries to do certain optimisations (many examples in other answers, won't repeat that).
atomic_flag and atomic_int (if is_lock_free() which they should) are safe to use anywhere, because they work like volatile with added memory bariers / synchronization when needed (avoiding the problems in previous point).
The reason I specifically said you have to only use those with is_lock_free() being true is because you cannot stop IRQ as you could stop a thread. No, IRQ interrupts main loop and does its job, it cannot wait-lock on a mutex because it is blocking the main loop it would be waiting for.
Practical note: I personally either use atomic_flag (the one and only guaranteed to work) to implement sort of spin-lock, where ISR will disable itself when finding the lock locked, while main loop will always re-enable the ISR after unlocking. Or I use double-counter lock-free queue (SPSC - single producer, single consumer) using that atomit_int. (Have one reader-counter and one writer-counter, subtract to find the real count. Good for UART etc.)

Function not called in code gets called at runtime

How can the following program be calling format_disk if it's never
called in code?
#include <cstdio>
static void format_disk()
{
std::puts("formatting hard disk drive!");
}
static void (*foo)() = nullptr;
void never_called()
{
foo = format_disk;
}
int main()
{
foo();
}
This differs from compiler to compiler. Compiling with Clang with
optimizations on, the function never_called executes at runtime.
$ clang++ -std=c++17 -O3 a.cpp && ./a.out
formatting hard disk drive!
Compiling with GCC, however, this code just crashes:
$ g++ -std=c++17 -O3 a.cpp && ./a.out
Segmentation fault (core dumped)
Compilers version:
$ clang --version
clang version 5.0.0 (tags/RELEASE_500/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ gcc --version
gcc (GCC) 7.2.1 20171128
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The program contains undefined behavior, as dereferencing a null pointer
(i.e. calling foo() in main without assigning a valid address to it
beforehand) is UB, therefore no requirements are imposed by the standard.
Executing format_disk at runtime is a perfect valid situation when
undefined behavior has been hit, it's as valid as just crashing (like
when compiled with GCC). Okay, but why is Clang doing that? If you
compile it with optimizations off, the program will no longer output
"formatting hard disk drive", and will just crash:
$ clang++ -std=c++17 -O0 a.cpp && ./a.out
Segmentation fault (core dumped)
The generated code for this version is as follows:
main: # #main
push rbp
mov rbp, rsp
call qword ptr [foo]
xor eax, eax
pop rbp
ret
It tries to make a call to a function to which foo points, and as foo
is initialized with nullptr (or if it didn't have any initialization,
this would still be the case), its value is zero. Here, undefined
behavior has been hit, so anything can happen at all and the program
is rendered useless. Normally, making a call to such an invalid address
results in segmentation fault errors, hence the message we get when
executing the program.
Now let's examine the same program but compiling it with optimizations on:
$ clang++ -std=c++17 -O3 a.cpp && ./a.out
formatting hard disk drive!
The generated code for this version is as follows:
never_called(): # #never_called()
ret
main: # #main
push rax
mov edi, .L.str
call puts
xor eax, eax
pop rcx
ret
.L.str:
.asciz "formatting hard disk drive!"
Interestingly, somehow optimizations modified the program so that
main calls std::puts directly. But why did Clang do that? And why is
never_called compiled to a single ret instruction?
Let's get back to the standard (N4660, specifically) for a moment. What
does it say about undefined behavior?
3.27 undefined behavior [defns.undefined]
behavior for which this document imposes no requirements
[Note: Undefined behavior may be expected when this document omits
any explicit definition of behavior or when a program uses an erroneous
construct or erroneous data. Permissible undefined behavior ranges
from ignoring the situation completely with unpredictable results, to
behaving during translation or program execution in a documented manner
characteristic of the environment (with or without the issuance of a
diagnostic message), to terminating a translation or execution (with the
issuance of a diagnostic message). Many erroneous program constructs
do not engender undefined behavior; they are required to be diagnosed.
Evaluation of a constant expression never exhibits behavior explicitly
specified as undefined ([expr.const]). — end note]
Emphasis mine.
A program that exhibits undefined behavior becomes useless, as everything
it has done so far and will do further has no meaning if it contains
erroneous data or constructs. With that in mind, do remember that
compilers may completely ignore for the case when undefined behavior
is hit, and this actually is used as discovered facts when optimizing a
program. For instance, a construct like x + 1 > x (where x is a signed integer) will be optimized away to a constant,
true, even if the value of x is unknown at compile-time. The reasoning
is that the compiler wants to optimize for valid cases, and the only
way for that construct to be valid is when it doesn't trigger arithmetic
overflow (i.e. if x != std::numeric_limits<decltype(x)>::max()). This
is a new learned fact in the optimizer. Based on that, the construct is
proven to always evaluate to true.
Note: this same optimization can't occur for unsigned integers, because overflowing one is not UB. That is, the compiler needs to keep the expression as it is, as it might have a different evaluation when overflow occurs (unsigned is module 2N, where N is number of bits). Optimizing it away for unsigned integers would be incompliant with the standard (thanks aschepler).
This is useful as it allows for tons of optimizations to kick
in. So
far, so good, but what happens if x holds its maximum value at runtime?
Well, that is undefined behavior, so it's nonsense to try to reason about
it, as anything may happen and the standard imposes no requirements.
Now we have enough information in order to better examine your faulty
program. We already know that accessing a null pointer is undefined
behavior, and that's what's causing the funny behavior at runtime.
So let's try and understand why Clang (or technically LLVM) optimized
the program the way it did.
static void (*foo)() = nullptr;
static void format_disk()
{
std::puts("formatting hard disk drive!");
}
void never_called()
{
foo = format_disk;
}
int main()
{
foo();
}
Remember that it's possible to call never_called before the main entry
starts executing. For example, when declaring a top-level variable,
you can call it while initializing the value of that variable:
void never_called();
int x = (never_called(), 42);
If you write this snippet in your program, the program no
longer exhibits undefined behavior, and the message "formatting hard
disk drive!" is displayed, with optimizations either on or off.
So what's the only way this program is valid? There's this never_caled
function that assigns the address of format_disk to foo, so we might
find something here. Note that foo is marked as static, which means it
has internal linkage and can't be accessed from outside this translation
unit. In contrast, the function never_called has external linkage, and may
be accessed from outside. If another translation unit contains a snippet
like the one above, then this program becomes valid.
Cool, but there's no one calling never_called from outside. Even though this
is the fact, the optimizer sees that the only way for this program to
be valid is if never_called is called before main executes, otherwise it's
just undefined behavior. That's a new learned fact, so the compiler assumes never_called
is in fact called. Based on that new knowledge, other optimizations that
kick in may take advantage of it.
For instance, when constant
folding is
applied, it sees that the construct foo() is only valid if foo can be properly initialized. The only way for that to happen is if never_called is called outside of this translation unit, so foo = format_disk.
Dead code elimination and interprocedural optimization might find out that if foo == format_disk, then the code inside never_called is unneeded,
so the function's body is transformed into a single ret instruction.
Inline expansion optimization
sees that foo == format_disk, so the call to foo can be replaced
with its body. In the end, we end up with something like this:
never_called():
ret
main:
mov edi, .L.str
call puts
xor eax, eax
ret
.L.str:
.asciz "formatting hard disk drive!"
Which is somewhat equivalent to the output of Clang with optimizations on. Of course, what Clang really did may (and might) be different, but optimizations are nonetheless capable of reaching the same conclusion.
Examining GCC's output with optimizations on, it seems it didn't bother investigating:
.LC0:
.string "formatting hard disk drive!"
format_disk():
mov edi, OFFSET FLAT:.LC0
jmp puts
never_called():
mov QWORD PTR foo[rip], OFFSET FLAT:format_disk()
ret
main:
sub rsp, 8
call [QWORD PTR foo[rip]]
xor eax, eax
add rsp, 8
ret
Executing that program results in a crash (segmentation fault), but if you call never_called in another translation unit before main gets executed, then this program doesn't exhibit undefined behavior anymore.
All of this can change crazily as more and more optimizations are engineered, so do not rely on the assumption that your compiler will take care of code containing undefined behavior, it might just screw you up as well (and format your hard drive for real!)
I recommend you read What every C programmer should know about Undefined Behavior and A Guide to Undefined Behavior in C and C++, both article series are very informative and might help you out with understanding the state of art.
Unless an implementation specifies the effect of trying to invoke a null function pointer, it could behave as a call to arbitrary code. Such arbitrary code could perfectly well behave like a call to function "foo()". While Annex L of the C Standard would invite implementations to distinguish between "Critical UB" and "non-critical UB", and some C++ implementations might apply a similar distinction, a invoking an invalid function pointer would be critical UB in any case.
Note that the situation in this question is very different from e.g.
unsigned short q;
unsigned hey(void)
{
if (q < 50000)
do_something();
return q*q;
}
In the latter situation, a compiler which does not claim to be "analyzable" might recognize that code will invoke if q is greater than 46,340 when execution reaches the return statement, and thus it might as well invoke do_something() unconditionally. While Annex L is badly written, it would seem the intention would be to forbid such "optimizations". In the case of calling an invalid function pointer, however, even straightforwardly-generated code on most platforms might have arbitrary behavior.

How to replace alloca in an implementation of execvp()?

Take a look at the NetBSD implementation of execvp here:
http://cvsweb.netbsd.se/cgi-bin/bsdweb.cgi/src/lib/libc/gen/execvp.c?rev=1.30.16.2;content-type=text%2Fplain
Note the comment at line 130, in the special case for handling ENOEXEC:
/*
* we can't use malloc here because, if we are doing
* vfork+exec, it leaks memory in the parent.
*/
if ((memp = alloca((cnt + 2) * sizeof(*memp))) == NULL)
goto done;
memp[0] = _PATH_BSHELL;
memp[1] = bp;
(void)memcpy(&memp[2], &argv[1], cnt * sizeof(*memp));
(void)execve(_PATH_BSHELL, __UNCONST(memp), environ);
goto done;
I am trying to port this implementation of execvp to standalone C++. alloca is nonstandard so I want to avoid it. (Actually the function I want is execvpe from FreeBSD, but this demonstrates the problem more clearly.)
I think I understand why it would leak memory if plain malloc was used - while the caller of execvp can execute code in the parent, the inner call to execve never returns so the function cannot free the memp pointer, and there's no way to get the pointer back to the caller. However, I can't think of a way to replace alloca - it seems to be necessary magic to avoid this memory leak. I have heard that C99 provides variable length arrays, which I cannot use sadly as the eventual target is C++.
Is it possible to replace this use of alloca? If it's mandated to stay within C++/POSIX, is there an inevitable memory leak when using this algorithm?
Edit: As Michael has pointed out in the comments, what is written below really won't work in the real-world due to stack-relative addressing by an optimizing compiler. Therefore a production-level alloca needs the help of the compiler to actually "work". But hopefully the code below could give some ideas about what's happening under the hood, and how a function like alloca might have worked if there were no stack-relative addressing optimizations to worry about.
BTW, just in case you were stil curious about how you could make a simple version of alloca for yourself, since that function basically returns a pointer to allocated space on the stack, you can write a function in assembly that can properly manipulate the stack, and return a pointer you can use in the current scope of the caller (once the caller returns, the stack space pointer from this version of alloca is invalidated since the return from the caller cleans up the stack).
Assuming you're using some flavor of Linux on a x86_64 platform using the Unix 64-bit ABI, place the following inside a file called "my_alloca.s":
.section .text
.global my_alloca
my_alloca:
movq (%rsp), %r11 # save the return address in temp register
subq %rdi, %rsp # allocate space on stack from first argument
movq $0x10, %rax
negq %rax
andq %rax, %rsp # align the stack to 16-byte boundary
movq %rsp, %rax # save address in return register
pushq %r11 # push return address on stack
ret # return back to caller
Then inside your C/C++ code module (i.e, your ".cpp" files), you can use it the following way:
extern my_alloca(unsigned int size);
void function()
{
void* stack_allocation = my_alloca(BUFFERSIZE);
//...do something with the allocated space
return; //WARNING: stack_allocation will be invalid after return
}
You can compile "my_alloca.s" using gcc -c my_alloca.s. This will give you a file named "my_alloca.o" that you can then use to link with your other object files using gcc -o or using ld.
The main "gotcha" that I could think of with this implementation is that you could crash or end up with undefined behavior if the compiler did not work by allocating space on the stack using an activation record and a stack base-pointer (i.e., the RBP pointer in x86_64), but rather explicitly allocated memory for each function call. Then, since the compiler won't be aware of the memory we've allocated on the stack, when it cleans up the stack at the return of the caller and tries to jump back using what it believes is the caller's return address that was pushed on the stack at the beginning of the function call, it will jump to an instruction pointer that's pointing to no-wheres-ville and you'll most likely crash with a bus error or some type of access error since you'll be trying to execute code in a memory location you're not allowed to.
There's actually other dangerous things that could happen, such as if the compiler used stack-space to allocate the arguments (it shouldn't for this function per the Unix 64-bit ABI since there's only a single argument), as that would again cause a stack clean-up right after the function call, messing up the validity of the pointer. But with a function like execvp(), which won't return unless there's an error, this shouldn't be so much of an issue.
All-in-all, a function like this will be platform-dependent.
You can replace the call to alloca with a call to malloc made before the call to vfork. After the vfork returns in the caller the memory can be deleted. (This is safe because vfork will not return until exec has been called and the new program started.) The caller can then free the memory it allocated with malloc.
This doesn't leak memory in the child because the exec call completely replaces the child image with the image of the parent process, implicitly releasing the memory that the forked process was holding.
Another possible solution is to switch to fork instead of vfork. This will require a little extra code in the caller because fork returns before the exec call is complete so the caller will need to wait for it. But once forked the new process could use malloc safely. My understanding of vfork is it was basically a poor man's fork because fork was expensive in the days before kernels had copy-on-write pages. Modern kernels implement fork very efficiently and there's no need resort to the somewhat dangerous vfork.

What kind of operations might one need to do before main()

I came across this question asking how to execute code before main() in C, mentioning there were strategies for C++. I've mostly lived in application space, so executing before main() has never occurred to me. What kind of things require this technique?
"What kind of things require this technique?"
Point of fact: none.
However, there are a lot of useful things you might WANT to do before main for a variety of reasons. For just one practical example, say you have an abstract factory that builds doohickies. You could make sure to build the factory instance, assign it to some special area, and then register the various concrete doohickies to it...yes, you can do that.
On the other hand, if you implement the factory as a singleton and use the facts of global value initialization to "trick" the implementation into registering concrete doohickies before main starts you gain several benefits with very few costs (the fact of using singletons, basically a non-issue here, is pretty much the only one).
For example you:
Don't have to maintain a list of registrations that all must be explicitly called. In fact, you can even declare and define an entire class in private scope, out of sight of anyone, and have it available for use when the program starts.
main() doesn't have to do a bunch of crap with a bunch of objects it doesn't care about.
So, none of this is actually necessary. However, you can reduce coupling and maintenance issues if you leverage the fact that globals are initialized before main begins.
Edit:
Should note here that I've since learned that this isn't guaranteed by the language. C++ only guarantees that zero or constant initialization happens before main. What I talk about in this answer is dynamic initialization. This C++ guarantees happens before the first use of the variable, much like function-local static variables.
Every compiler though seems to do dynamic initialization before main. I thought I ran into one once that did not but I believe the source of the issue was something else.
This technique can be used for library initialization routines or for initializing data that will be used implicitly during the execution of the program.
GCC provides constructor and destructor function attributes that cause a function to be called automatically before execution enters main() or main() has completed or exit() has been called, respectively.
void __attribute__ ((constructor)) my_init(void);
void __attribute__ ((destructor)) my_fini(void);
In the case of library initialization, constructor routines are executed before dlopen() returns if the library is loaded at runtime or before main() is started if the library is loaded at load time. When used for library cleanup, destructor routines are executed before dlclose() returns if the library is loaded at runtime or after exit() or completion of main() if the library is loaded at load time.
The only things you could want to do before main involve global variables, which are bad, and the same things could always be accomplished by lazy initialization (initialization at the point of first use). Of course they'd be much better accomplished by not using global variables at all.
One possible "exception" is the initialization of global constant tables at runtime. But this is a very bad practice, as the tables are not sharable between instances of a library/process if you fill them at runtime. It's much smarter to write a script to generate the static const tables as a C or C++ source file at build time.
Stuff done before main:
On x86, the stack pointer register is usually &=0XF3 to make it a multiple of 4 (alignment)
Static members are initialized
push argc and argv (and environ if needed)
call _main =p
g++ 4.4 emits the following before any of my code is emitted. Technically it inserts it into the top of main before any of my code, but I've seen compilers that use _init instead of _main as the entry point:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
# My code follows
Anything that needs to run code to guarantee invariants for your code after main starts needs to run before main. Things like global iostreams, the C runtime library, OS bindings, etc.
Now whether you actually need to write code that does such things as well is what everyone else is answering.
If you have a library, it is very convenient to be able to initialise some data, create threads etc. before main() is invoked, and know that a desired state is achieved without burdening and trusting the client app to explicitly call some library initialisation and/or shutdown code. Superficially, this can be achieved by having a static object whose constructor and destructor performs the necessary operations. Unfortunately, multiple static objects in different translation units or libraries will have an undefined order of initialisation, so if they depend upon each other (worse yet, in a cyclic fashion), then they may still not have achieved their initialised state before a request comes in. Similarly, one static object may create threads and call services in another object that aren't yet threadsafe. So, a more structured approach with proper singleton instances and locks is needed for robustness in the face of arbitrary usage, and the whole thing looks much less appealing, though it may still be a net win in some cases.