I'm curious about the underlying implementation of static variables within a function.
If I declare a static variable of a fundamental type (char, int, double, etc.), and give it an initial value, I imagine that the compiler simply sets the value of that variable at the very beginning of the program before main() is called:
void SomeFunction();
int main(int argCount, char ** argList)
{
// at this point, the memory reserved for 'answer'
// already contains the value of 42
SomeFunction();
}
void SomeFunction()
{
static int answer = 42;
}
However, if the static variable is an instance of a class:
class MyClass
{
//...
};
void SomeFunction();
int main(int argCount, char ** argList)
{
SomeFunction();
}
void SomeFunction()
{
static MyClass myVar;
}
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
static bool initialized = 0;
if (!initialized)
{
// construct myVar
initialized = 1;
}
This question covered similar ground, but thread safety wasn't mentioned. For what it's worth, C++0x will make function static initialisation thread safe.
(see the C++0x FCD, 6.7/4 on function statics: "If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization.")
One other thing that hasn't been mentioned is that function statics are destructed in reverse order of their construction, so the compiler maintains a list of destructors to call on shutdown (this may or may not be the same list that atexit uses).
In the compiler output I have seen, function local static variables are initialized exactly as you imagine.
(Caveat: This paragraph applies to C++ versions older than C++11. See the comments for changes since C++11.) Note that in general this is not done in a thread-safe manner. So if you have functions with static locals like that that might be called from multiple threads, you should take this into account. Calling the function once in the main thread before any others are called will usually do the trick.
I should add that if the initialization of the local static is by a simple constant like in your example, the compiler doesn't need to go through these gyrations - it can just initialize the variable in the image or before main() like a regular static initialization (because your program wouldn't be able to tell the difference). But if you initialize it with a function's return value, then the compiler pretty much has to test a flag indicating if the initialization has been done or something equivalent.
You're right about everything, including the initialized flag as a common implementation. This is basically why initialization of static locals is not thread-safe, and why pthread_once exists.
One slight caveat: the compiler must emit code which "behaves as if" the static local variable is constructed the first time it is used. Since integer initialization has no side effects (and calls no user code), it's up to the compiler when it initializes the int. User code cannot "legitimately" find out what it does.
Obviously you can look at the assembly code, or provoke undefined behaviour and make deductions from what actually happens. But the C++ standard doesn't count that as valid grounds to claim that the behaviour is not "as if" it did what the spec says.
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
Yes, that's right: and, FWIW, it's not necessarily thread-safe (if the function is called "for the first time" by two threads simultaneously).
For that reason you might prefer to define the variable at global scope (although maybe in a class or namespace, or static without external linkage) instead of inside a function, so that it's initialized before the program starts without any run-time "if".
Another twist is in embedded code, where the run-before-main() code (cinit/whatever) may copy pre-initialized data (both statics and non-statics) into ram from a const data segment, perhaps residing in ROM. This is useful where the code may not be running from some sort of backing store (disk) where it can be re-loaded from. Again, this doesn't violate the requirements of the language, since this is done before main().
Slight tangent: While I've not seen it done much (outside of Emacs), a program or compiler could basically run your code in a process and instantiate/initialize objects, then freeze and dump the process. Emacs does something similar to this to load up large amounts of elisp (i.e. chew on it), then dump the running state as the working executable, to avoid the cost of parsing on each invocation.
The relevant thing isn't being a class type or not, it's compile-time evaluation of the initializer (at the current optimization level). And of course the constructor not having any side-effects, if it's non-trivial.
If it's not possible to simply put a constant value in .data, gcc/clang use an acquire load of a guard variable to check that static locals have been initialized. If the guard variable is false, then they pick one thread to do the initializing, and have other threads wait for it if they also see a false guard variable. They've been doing this for a long time, since before C++11 required it. (e.g. as old as GCC4.1 on Godbolt, from May 2006.)
Does a function local static variable automatically incur a branch? shows what GCC does.
Cost of thread-safe local static variable initialization in C++11? same
Why does initialization of local static objects use hidden guard flags? same
The most simple artificial example, snapshotting the arg from the first call and ignoring later args:
int foo(int a){
static int x = a;
return x;
}
Compiles for x86-64 with GCC11.3 -O3 (Godbolt), with the exact same asm generated for -std=gnu++03 mode. GCC4.1 also makes about the same asm, but doesn't keep the push/pop off the fast path (i.e. missing shrink-wrap optimization). GCC4.1 only supported AT&T syntax output, so it visually looks different unless you flip modern GCC to AT&T mode as well, but this is Intel syntax (destination on the left).
# demangled asm from g++ -O3
foo(int):
movzx eax, BYTE PTR guard variable for foo(int)::x[rip] # guard.load(acquire)
test al, al
je .L13
mov eax, DWORD PTR foo(int)::x[rip] # normal load of the static local
ret # fast path through the function is the already-initialized case
.L13: # jumps here on guard == 0, on the first call (and any that race with it)
# It would be sensible for GCC to put this code in .text.cold
push rbx
mov ebx, edi # save function arg in a call-preserved reg
mov edi, OFFSET FLAT:guard variable for foo(int)::x # address
call __cxa_guard_acquire # guard_acquire(&guard_x) presumably a normal mutex or spinlock
test eax, eax
jne .L14 # if (we won the race to do the init work) goto .L14
mov eax, DWORD PTR foo(int)::x[rip] # else it's done now by another thread
pop rbx
ret
.L14:
mov edi, OFFSET FLAT:guard variable for foo(int)::x
mov DWORD PTR foo(int)::x[rip], ebx # init static x (from a saved in RBX)
call __cxa_guard_release
mov eax, DWORD PTR foo(int)::x[rip] # missed optimization: mov eax, ebx
# This thread is the one that just initialized it, our function arg is the value.
# It's not atomic (or volatile), so another thread can't have set it, too.
pop rbx
ret
If compiling for AArch64, the load of the guard variable is ldarb w8, [x8], a load with acquire semantics. Other ISAs might need a plain load and then a barrier to give at least LoadLoad ordering, to make sure they load the payload x no earlier than when they saw the guard variable being non-zero.
If the static variable has a constant initializer, no guard is needed
int bar(int a){
static int x = 1;
return ++x + a;
}
bar(int):
mov eax, DWORD PTR bar(int)::x[rip]
add eax, 1
mov DWORD PTR bar(int)::x[rip], eax # store the updated value
add eax, edi # and add it to the function arg
ret
.section .data
bar(int)::x:
.long 1
Related
I'm curious about the underlying implementation of static variables within a function.
If I declare a static variable of a fundamental type (char, int, double, etc.), and give it an initial value, I imagine that the compiler simply sets the value of that variable at the very beginning of the program before main() is called:
void SomeFunction();
int main(int argCount, char ** argList)
{
// at this point, the memory reserved for 'answer'
// already contains the value of 42
SomeFunction();
}
void SomeFunction()
{
static int answer = 42;
}
However, if the static variable is an instance of a class:
class MyClass
{
//...
};
void SomeFunction();
int main(int argCount, char ** argList)
{
SomeFunction();
}
void SomeFunction()
{
static MyClass myVar;
}
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
static bool initialized = 0;
if (!initialized)
{
// construct myVar
initialized = 1;
}
This question covered similar ground, but thread safety wasn't mentioned. For what it's worth, C++0x will make function static initialisation thread safe.
(see the C++0x FCD, 6.7/4 on function statics: "If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization.")
One other thing that hasn't been mentioned is that function statics are destructed in reverse order of their construction, so the compiler maintains a list of destructors to call on shutdown (this may or may not be the same list that atexit uses).
In the compiler output I have seen, function local static variables are initialized exactly as you imagine.
(Caveat: This paragraph applies to C++ versions older than C++11. See the comments for changes since C++11.) Note that in general this is not done in a thread-safe manner. So if you have functions with static locals like that that might be called from multiple threads, you should take this into account. Calling the function once in the main thread before any others are called will usually do the trick.
I should add that if the initialization of the local static is by a simple constant like in your example, the compiler doesn't need to go through these gyrations - it can just initialize the variable in the image or before main() like a regular static initialization (because your program wouldn't be able to tell the difference). But if you initialize it with a function's return value, then the compiler pretty much has to test a flag indicating if the initialization has been done or something equivalent.
You're right about everything, including the initialized flag as a common implementation. This is basically why initialization of static locals is not thread-safe, and why pthread_once exists.
One slight caveat: the compiler must emit code which "behaves as if" the static local variable is constructed the first time it is used. Since integer initialization has no side effects (and calls no user code), it's up to the compiler when it initializes the int. User code cannot "legitimately" find out what it does.
Obviously you can look at the assembly code, or provoke undefined behaviour and make deductions from what actually happens. But the C++ standard doesn't count that as valid grounds to claim that the behaviour is not "as if" it did what the spec says.
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
Yes, that's right: and, FWIW, it's not necessarily thread-safe (if the function is called "for the first time" by two threads simultaneously).
For that reason you might prefer to define the variable at global scope (although maybe in a class or namespace, or static without external linkage) instead of inside a function, so that it's initialized before the program starts without any run-time "if".
Another twist is in embedded code, where the run-before-main() code (cinit/whatever) may copy pre-initialized data (both statics and non-statics) into ram from a const data segment, perhaps residing in ROM. This is useful where the code may not be running from some sort of backing store (disk) where it can be re-loaded from. Again, this doesn't violate the requirements of the language, since this is done before main().
Slight tangent: While I've not seen it done much (outside of Emacs), a program or compiler could basically run your code in a process and instantiate/initialize objects, then freeze and dump the process. Emacs does something similar to this to load up large amounts of elisp (i.e. chew on it), then dump the running state as the working executable, to avoid the cost of parsing on each invocation.
The relevant thing isn't being a class type or not, it's compile-time evaluation of the initializer (at the current optimization level). And of course the constructor not having any side-effects, if it's non-trivial.
If it's not possible to simply put a constant value in .data, gcc/clang use an acquire load of a guard variable to check that static locals have been initialized. If the guard variable is false, then they pick one thread to do the initializing, and have other threads wait for it if they also see a false guard variable. They've been doing this for a long time, since before C++11 required it. (e.g. as old as GCC4.1 on Godbolt, from May 2006.)
Does a function local static variable automatically incur a branch? shows what GCC does.
Cost of thread-safe local static variable initialization in C++11? same
Why does initialization of local static objects use hidden guard flags? same
The most simple artificial example, snapshotting the arg from the first call and ignoring later args:
int foo(int a){
static int x = a;
return x;
}
Compiles for x86-64 with GCC11.3 -O3 (Godbolt), with the exact same asm generated for -std=gnu++03 mode. GCC4.1 also makes about the same asm, but doesn't keep the push/pop off the fast path (i.e. missing shrink-wrap optimization). GCC4.1 only supported AT&T syntax output, so it visually looks different unless you flip modern GCC to AT&T mode as well, but this is Intel syntax (destination on the left).
# demangled asm from g++ -O3
foo(int):
movzx eax, BYTE PTR guard variable for foo(int)::x[rip] # guard.load(acquire)
test al, al
je .L13
mov eax, DWORD PTR foo(int)::x[rip] # normal load of the static local
ret # fast path through the function is the already-initialized case
.L13: # jumps here on guard == 0, on the first call (and any that race with it)
# It would be sensible for GCC to put this code in .text.cold
push rbx
mov ebx, edi # save function arg in a call-preserved reg
mov edi, OFFSET FLAT:guard variable for foo(int)::x # address
call __cxa_guard_acquire # guard_acquire(&guard_x) presumably a normal mutex or spinlock
test eax, eax
jne .L14 # if (we won the race to do the init work) goto .L14
mov eax, DWORD PTR foo(int)::x[rip] # else it's done now by another thread
pop rbx
ret
.L14:
mov edi, OFFSET FLAT:guard variable for foo(int)::x
mov DWORD PTR foo(int)::x[rip], ebx # init static x (from a saved in RBX)
call __cxa_guard_release
mov eax, DWORD PTR foo(int)::x[rip] # missed optimization: mov eax, ebx
# This thread is the one that just initialized it, our function arg is the value.
# It's not atomic (or volatile), so another thread can't have set it, too.
pop rbx
ret
If compiling for AArch64, the load of the guard variable is ldarb w8, [x8], a load with acquire semantics. Other ISAs might need a plain load and then a barrier to give at least LoadLoad ordering, to make sure they load the payload x no earlier than when they saw the guard variable being non-zero.
If the static variable has a constant initializer, no guard is needed
int bar(int a){
static int x = 1;
return ++x + a;
}
bar(int):
mov eax, DWORD PTR bar(int)::x[rip]
add eax, 1
mov DWORD PTR bar(int)::x[rip], eax # store the updated value
add eax, edi # and add it to the function arg
ret
.section .data
bar(int)::x:
.long 1
I'm learning about using inline assembly inside the C++ code.
Here is the very simple example:
// Power2_inline_asm.c
// compile with: /EHsc
// processor: x86
#include <stdio.h>
int power2( int num, int power );
int main( void )
{
printf_s( "3 times 2 to the power of 5 is %d\n", \
power2( 3, 5) );
}
int power2( int num, int power )
{
__asm
{
mov eax, num ; Get first argument
mov ecx, power ; Get second argument
shl eax, cl ; EAX = EAX * ( 2 to the power of CL )
}
// Return with result in EAX
}
Since the power2 function returns the result WHY isn't there a ret instruction a the end of the asm code?
Or a C++ return keyword outside the asm block, before the end of the function?
EAX is implied to contain return value, and there's ret generated by complier (some code is generated by compiler, if __declspec(naked) is not specified). Since there's no C++ return statement, from C++ point of view the behavior is undefined, the manifestation of undefined behavior is to return whatever EAX contains, which is the result.
It seems you're unclear about the relationship between the ret instruction and return values. There is none.
The operand to the ret instruction is not the return value, it's the number of bytes to remove from the stack for calling conventions where the callee handles argument cleanup.
The return value is passed in some other way, controlled by the calling convention, and must be stored before reaching the ret instruction.
Not having a ret instruction in the asm is totally normal; you want to let the compiler generate function prologues / epilogues, including a ret instruction, like they normally do for paths of execution that reach the } in a function. (Or you'd use __declspec(naked) to write the whole function in asm, including handling the calling convention, which would let you use fastcall to take args in registers instead of needing to load them from stack memory).
The more interesting thing is falling off the end of a non-void function without a return. (I edited your question to ask about that, too).
In ISO C++, that's undefined behaviour. (So compilers like clang -fasm-blocks can assume that path of execution is never reached, and not even emit any instructions for it, not even a ret.) But MSVC does at least de-facto define the behaviour of doing that,
MSVC specifically does support falling off the end of a non-void function after an asm statement, treating EAX or ST0 as the return value. Or at least that's how MSVC de-facto works, whether it's intentional support or not, but it does even support inlining such functions, so it's not just a calling-convention abuse of UB. (clang -fasm-blocks does not work that way; IDK about clang-cl. But it does not define the behaviour of falling off the end of a non-void function, fully omitting the ret because that path of execution must not be reachable.)
Not using ret in the asm{} block
ESP isn't pointing at the return address when the asm{} block executes; I think MSVC always forces functions using asm{} to set up EBP as a frame pointer.
You definitely can't just ret out of the middle of a function without giving the compiler a chance to restore call-preserved registers and clean up the stack in the function epilogue.
Also, what if the compiler had inlined power2 into a caller?
Then you'd be returning from that caller (if you did leave / ret in an asm block).
Look at the compiler-generated asm.
(TODO: I was going to write more and link something on https://godbolt.org/, but never got back to it.)
I'm wondering if there could be a problem with putting normal_distribution in a loop.
Here is the code that uses normal_distribution in this strange way:
std::default_random_engine generator;
//std::normal_distribution<double> distribution(5.0,2.0);
for (int i=0; i<nrolls; ++i) {
std::normal_distribution<double> distribution(5.0,2.0);
float x = distribution(generator);
}
Putting the normal_distribution object outside the loop may be slightly more efficient than putting it in the loop. When it's inside the loop, the normal_distribution object may be re-constructed every time, whereas if it's outside the loop it's only constructed once.
Comparison of the assembly.
Based on an analysis of the assembly, declaring distribution outside the loop is more efficient.
Let's look at two different functions, along with the corresponding assembly. One of them declares distribution inside the loop, and the other one declares it outside the loop. To simplify the analysis, they're declared const in both cases, so we (and the compiler) know that the distribution doesn't get modified.
You can see the complete assembly here.
// This function is here to prevent the compiler from optimizing out the
// loop entirely
void doSomething(std::normal_distribution<double> const& d) noexcept;
void inside_loop(double mean, double sd, int n) {
for(int i = 0; i < n; i++) {
const std::normal_distribution<double> d(mean, sd);
doSomething(d);
}
}
void outside_loop(double mean, double sd, int n) {
const std::normal_distribution<double> d(mean, sd);
for(int i = 0; i < n; i++) {
doSomething(d);
}
}
inside_loop assembly
The assembly for the loop looks like this (compiled with gcc 8.3 at O3 optimization).
.L3:
movapd xmm2, XMMWORD PTR [rsp]
lea rdi, [rsp+16]
add ebx, 1
mov BYTE PTR [rsp+40], 0
movaps XMMWORD PTR [rsp+16], xmm2
call foo(std::normal_distribution<double> const&)
cmp ebp, ebx
jne .L3
Basically, it
- constructs the distribution
- invokes foo with the distribution
- tests to see if it should exit the loop
outside_loop assembly
Using the same compilation options, outside_loop just calls foo repeatedly without re-constructing the distribution. There's fewer instructions, and everything stays within the registers (so no need to access the stack).
.L12:
mov rdi, rsp
add ebx, 1
call foo(std::normal_distribution<double> const&)
cmp ebp, ebx
jne .L12
Are there ever any reasons to declare variables inside a loop?
Yes. There are definitely good times to declare variables inside a loop. If you were modifying distribution somehow inside the loop, then it would make sense to reset it every time just by constructing it again.
Furthermore, if you don't ever use a variable outside of a loop, it makes sense to declare it inside the loop just for the purposes of readability.
Types that fit inside a CPU's registers (so floats, ints, doubles, and small user-defined types) oftentimes have no overhead associated with their construction, and declaring them inside a loop can actually lead to better assembly by simplifying compiler analysis of register allocation.
Looking at the interface of the normal distribution, there is a member called reset, who:
resets the internal state of the distribution
This implies that the distribution may have an internal state. If it does, then you definitely reset that when you recreate the object at each iteration. Not using it as intended may produce a distribution which is not normal or might be just inefficient.
What state could it be? That is certainly implementation defined. Looking at one implementation from LLVM, the normal distribution is defined around here. More specifically, the operator() is here. Looking at the code, there is certainly some state shared in between subsequent calls. More specifically, at each subsequent call, the state of the boolean variable _V_hot_ is flipped. If it is true, significantly less computations are performed and the value of the stored _V_ is used. If it is false, then _V_ is computed from scratch.
I did not look very deep into why did they choose to do this. But, looking only at the computations performed, it should be much faster to rely on the internal state. While this is only some implementation, it shows that the standard allows the usage of internal state, and in some case it is beneficial.
Later edit:
The GCC libstdc++ implementation of std::normal_distribution can be found here. Note that the operator() calls another function, __generate_impl, which is defined in a separate file here. While different, this implementation has the same flag, here named _M_saved_available that speeds up every other call.
I left the rest of implementation for simplicity because it is not relevant here.
Consider the classical implemetation of Double-check loking descibed in Modern C++ Design.
Singleton& Singleton::Instance()
{
if(!pInstance_)
{
Guard myGuard(lock_);
if (!pInstance_)
{
pInstance_ = new Singleton;
}
}
return *pInstance_;
}
Here the author insists that we avoid the race condition. But I have read an article, which unfortunately I dont remeber very well, in which the following flow was described.
Thread 1 enters first if statement
Thread 1 enters the mutex end get in the second if body.
Thread 1 calls operator new and assigns memory to pInstance than calls a constructor on that memory;
Suppose the thread 1 assigned the memory to pInstance but not created the object and thread 2 enters the function.
Thread 2 see that the pInstance is not null (but yet not initialized with constructor) and returns the pInstance.
In that article the author stated then the trick is that on the line pInstance_ = new Singleton; the memory can be allocated, assigned to pInstance that the constructor will be called on that memory.
Relying to standard or other reliable sources, can anyone please confirm or deny the probability or correctness of this flow? Thanks!
The problem you describe can only occur if for reasons I cannot imagine the conceptors of the singleton uses an explicit (and broken) 2 steps construction:
...
Guard myGuard(lock_);
if (!pInstance_)
{
auto alloc = std::allocator<Singleton>();
pInstance_ = alloc.allocate(); // SHAME here: race condition
// eventually other stuff
alloc.construct(_pInstance); // anything could have happened since allocation
}
....
Even if for any reason such a 2 step construction was required, the _pInstance member shall never contain anything else that nullptr or a fully constructed instance:
auto alloc = std::allocator<Singleton>();
Singleton *tmp = alloc.allocate(); // no problem here
// eventually other stuff
alloc.construct(tmp); // nor here
_pInstance = tmp; // a fully constructed instance
But beware: the fix is only guaranteed on a mono CPU. Things could be much worse on multi core systems where the C++11 atomics semantics are indeed required.
The problem is that in the absence of guarantees otherwise, the store of the pointer into pInstance_ might be seen by some other thread before the construction of the object is complete. In that case, the other thread won't enter the mutex and will just immediately return pInstance_ and when the caller uses it, it can see uninitialized values.
This apparent reordering between the store(s) associated with the construction on Singleton and the store to pInstance_ may be caused by compiler or the hardware. I'll take a quick look at both cases below.
Compiler Reordering
Absent any specific guarantees guarantees related to concurrent reads (such as those offered by C++11's std::atomic objects) the compiler only needs to preserve the semantics of the code as seen by the current thread. That means, for example, that it may compile code "out of order" to how it appears in the source, as long as this doesn't have visible side-effects (as defined by the standard) on the current thread.
In particular, it would not be uncommon for the compiler to reorder stores performed in the constructor for Singleton, with the store to pInstance_, as long as it can see that the effect is the same1.
Let's take a look at a fleshed out version of your example:
struct Lock {};
struct Guard {
Guard(Lock& l);
};
int value;
struct Singleton {
int x;
Singleton() : x{value} {}
static Lock lock_;
static Singleton* pInstance_;
static Singleton& Instance();
};
Singleton& Singleton::Instance()
{
if(!pInstance_)
{
Guard myGuard(lock_);
if (!pInstance_)
{
pInstance_ = new Singleton;
}
}
return *pInstance_;
}
Here, the constructor for Singleton is very simple: it simply reads from the global value and assigns it to the x, the only member of Singleton.
Using godbolt, we can check exactly how gcc and clang compile this. The gcc version, annotated, is shown below:
Singleton::Instance():
mov rax, QWORD PTR Singleton::pInstance_[rip]
test rax, rax
jz .L9 ; if pInstance != NULL, go to L9
ret
.L9:
sub rsp, 24
mov esi, OFFSET FLAT:_ZN9Singleton5lock_E
lea rdi, [rsp+15]
call Guard::Guard(Lock&) ; acquire the mutex
mov rax, QWORD PTR Singleton::pInstance_[rip]
test rax, rax
jz .L10 ; second check for null, if still null goto L10
.L1:
add rsp, 24
ret
.L10:
mov edi, 4
call operator new(unsigned long) ; allocate memory (pointer in rax)
mov edx, DWORD value[rip] ; load value global
mov QWORD pInstance_[rip], rax ; store pInstance pointer!!
mov DWORD [rax], edx ; store value into pInstance_->x
jmp .L1
The last few lines are critical, in particular the two stores:
mov QWORD pInstance_[rip], rax ; store pInstance pointer!!
mov DWORD [rax], edx ; store value into pInstance_->x
Effectively, the line pInstance_ = new Singleton; been transformed into:
Singleton* stemp = operator new(sizeof(Singleton)); // (1) allocate uninitalized memory for a Singleton object on the heap
int vtemp = value; // (2) read global variable value
pInstance_ = stemp; // (3) write the pointer, still uninitalized, into the global pInstance (oops!)
pInstance_->x = vtemp; // (4) initialize the Singleton by writing x
Oops! Any second thread arriving when (3) has occurred, but (4) hasn't, will see a non-null pInstance_, but then read an uninitialized (garbage) value for pInstance->x.
So even without invoking any weird hardware reordering at all, this pattern isn't safe without doing more work.
Hardware Reordering
Let's say you organize so that the reordering of the stores above doesn't occur on your compiler2, perhaps by putting a compiler barrier such as asm volatile ("" ::: "memory"). With that small change, gcc now compiles this to have the two critical stores in the "desired" order:
mov DWORD PTR [rax], edx
mov QWORD PTR Singleton::pInstance_[rip], rax
So we're good, right?
Well on x86, we are. It happens that x86 has a relatively strong memory model, and all stores already have release semantics. I won't describe the full semantics, but in the context of two stores as above, it implies that stores appear in program order to other CPUs: so any CPU that sees the second write above (to pInstance_) will necessarily see the prior write (to pInstance_->x).
We can illustrate that by using the C++11 std::atomic feature to explicitly ask for a release store for pInstance_ (this also enables us to get rid of the compiler barrier):
static std::atomic<Singleton*> pInstance_;
...
if (!pInstance_)
{
pInstance_.store(new Singleton, std::memory_order_release);
}
We get reasonable assembly with no hardware memory barriers or anything (there is a redundant load now, but this is both a missed-optimization by gcc and a consequence of the way we wrote the function).
So we're done, right?
Nope - most other platforms don't have the strong store-to-store ordering that x86 does.
Let's take a look at ARM64 assembly around the creation of the new object:
bl operator new(unsigned long)
mov x1, x0 ; x1 holds Singleton* temp
adrp x0, .LANCHOR0
ldr w0, [x0, #:lo12:.LANCHOR0] ; load value
str w0, [x1] ; temp->x = value
mov x0, x1
str x1, [x19, #pInstance_] ; pInstance_ = temp
So we have the str to pInstance_ as the last store, coming after the temp->x = value store, as we want. However, the ARM64 memory model doesn't guarantee that these stores appear in program order when observed by another CPU. So even though we've tamed the compiler, the hardware can still trip us up. You'll need a barrier to solve this.
Prior to C++11, there wasn't a portable solution for this problem. For a particular ISA you could use inline assembly to emit the right barrier. Your compiler might have a builtin like __sync_synchronize offered by gcc, or your OS might even have something.
In C++11 and beyond, however, we finally have a formal memory model built-in to the language, and what we need there, for doubled check locking is a release store, as the final store to pInstance_. We saw this already for x86 where we checked that no compiler barrier was emitted, using std::atomic with memory_order_release the object publishing code becomes:
bl operator new(unsigned long)
adrp x1, .LANCHOR0
ldr w1, [x1, #:lo12:.LANCHOR0]
str w1, [x0]
stlr x0, [x20]
The main difference is the final store is now stlr - a release store. You can check out the PowerPC side too, where an lwsync barrier has popped up between the two stores.
So the bottom line is that:
Double checked locking is safe in a sequentially consistent system.
Real-world systems almost always deviate from sequential consistency, either due to the hardware, the compiler or both.
To solve that, you need tell the compiler what you want, and it will both avoid reordering itself and emit the necessary barrier instructions, if any, to prevent the hardware from causing a problem.
Prior to C++11, the "way you tell the compiler" to do that was platform/compiler/OS specific, but in C++ you can simply use std::atomic with memory_order_acquire loads and memory_order_release stores.
The Load
The above only covered half of the problem: the store of pInstance_. The other half that can go wrong is the load, and the load is actually the most important for performance, since it represents the usual fast-path that gets taken after the singleton is initialized. What if the pInstance_->x was loaded before pInstance itself was loaded and checked for null? In that case, you could still read an uninitialized value!
This might seem unlikely, since pInstance_ needs to be loaded before it is deferenced, right? That is, there seems to be a fundamental dependency between the operations that prevents reordering, unlike the store case. Well, as it turns out, both hardware behavior and software transformation could still trip you up here, and the details are even more complex than the store case. If you use memory_order_acquire though, you'll be fine. If you want the last once of performance, especially on PowerPC, you'll need to dig into the details of memory_order_consume. A tale for another day.
1 In particular, this means that the compiler has to be able to see the code for the constructor Singleton() so that it can determine that it doesn't read from pInstance_.
2 Of course, it's very dangerous to rely on this since you'd have to check the assembly after every compilation if anything changed!
It used to be unspecified before C++11, because there was no standard memory model discussing multiple threads.
IIRC the pointer could have been set to the allocated address before the constructor completed so long as that thread would never be able to tell the difference (this could probably only happen for a trivial/non-throwing constructor).
Since C++11, the sequenced-before rules disallow that reordering, specifically
8) The side effect (modification of the left argument) of the built-in assignment operator ... is sequenced after the value computation ... of both left and right arguments, ...
Since the right argument is a new-expression, that must have completed allocation & construction before the left-hand-side can be modified.
Say i were to allocate 2 memory blocks.
I use the first memory block to store something and use this stored data.
Then i use the second memory block to do something similar.
{
int a[10];
int b[10];
setup_0(a);
use_0(a);
setup_1(b);
use_1(b);
}
|| compiler optimizes this to this?
\/
{
int a[10];
setup_0(a);
use_0(a);
setup_1(a);
use_1(a);
}
// the setup functions overwrites all 10 words
The question is now: Do compiler optimize this, so that they reuse the existing memory blocks, instead of allocating a second one, if the compiler knows that the first block will not be referenced again?
If this is true:
Does this also work with dynamic memory allocation?
Is this also possible if the memory persists outside the scope, but is used in the same way as given in the example?
I assume this only works if setup and foo are implemented in the same c file (exist in the same object as the calling code)?
Do compiler optimize this
This question can only be answered if you ask about a particular compiler. And the answer can be found by inspecting the generated code.
so that they reuse the existing memory blocks, instead of allocating a second one, if the compiler knows that the first block will not be referenced again?
Such optimization would not change the behaviour of the program, so it would be allowed. Another matter is: Is it possible to prove that the memory will not be referenced? If it is possible, then is it easy enough to prove in reasonable time? I feel very safe in saying that it is not possible to prove in general, but it is provable in some cases.
I assume this only works if setup and foo are implemented in the same c file (exist in the same object as the calling code)?
That would usually be required to prove the untouchability of the memory. Link time optimization might lift this requirement, in theory.
Does this also work with dynamic memory allocation?
In theory, since it doesn't change the behaviour of the program. However, the dynamic memory allocation is typically performed by a library and thus the compiler may not be able to prove the lack of side-effects and therefore wouldn't be able to prove that removing an allocation wouldn't change behaviour.
Is this also possible if the memory persists outside the scope, but is used in the same way as given in the example?
If the compiler is able to prove that the memory is leaked, then perhaps.
Even though the optimization may be possible, it is not very significant. Saving a bit of stack space probably has very little effect on run time. It could be useful to prevent stack overflows if the arrays are large.
https://godbolt.org/g/5nDqoC
#include <cstdlib>
extern int a;
extern int b;
int main()
{
{
int tab[1];
tab[0] = 42;
a = tab[0];
}
{
int tab[1];
tab[0] = 42;
b = tab[0];
}
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
mov DWORD PTR a[rip], 42
mov DWORD PTR b[rip], 42
xor eax, eax
ret
If you follow the link you should see the code being compiled on gcc and clang with -O3 optimisation level. The resulting asm code is pretty straight forward. As the value stored in the array is know at compilation time, the compiler can easily skip everything and straight up set the variables a and b. Your buffer is not needed.
Following a code similar to the one provided in your example:
https://godbolt.org/g/bZHSE4
#include <cstdlib>
int func1(const int (&tab)[10]);
int func2(const int (&tab)[10]);
int main()
{
int a[10];
int b[10];
func1(a);
func2(b);
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
sub rsp, 104
mov rdi, rsp ; first address is rsp
call func1(int const (&) [10])
lea rdi, [rsp+48] ; second address is [rsp+48]
call func2(int const (&) [10])
xor eax, eax
add rsp, 104
ret
You can see the pointer sent to the function func1 and func2 is different as the first pointer used is rsp in the call to func1, and [rsp+48] in the call to func2.
You can see that either the compiler completely ignores your code in the case it is predictable. In the other case, at least for gcc 7 and clang 3.9.1, it is not optimized.
https://godbolt.org/g/TnV62V
#include <cstdlib>
extern int * a;
extern int * b;
inline int do_stuff(int ** to)
{
*to = (int *) malloc(sizeof(int));
(**to) = 42;
return **to;
}
int main()
{
do_stuff(&a);
free(a);
do_stuff(&b);
free(b);
return 0;
}
Compiled with gcc 7 with -O3 compilation flag:
main:
sub rsp, 8
mov edi, 4
call malloc
mov rdi, rax
mov QWORD PTR a[rip], rax
call free
mov edi, 4
call malloc
mov rdi, rax
mov QWORD PTR b[rip], rax
call free
xor eax, eax
add rsp, 8
ret
While not being fluent at reading this, it is pretty easy to tell that with the following example, malloc and free is not being optimized neither by gcc or clang (if you want to try with more compiler, suit yourself but don't forget to set the optimization flag).
You can clearly see a call to "malloc" followed by a call to "free", twice
Optimizing stack space is quite unlikely to really have an effect on the speed of your program, unless you manipulate large amount of data.
Optimizing dynamically allocated memory is more relevant. AFAIK you will have to use a third-party library or run your own system if you plan to do that and this is not a trivial task.
EDIT: Forgot to mention the obvious, this is very compiler dependent.
As the compiler sees that a is used as a parameter for a function, it will not optimize b away. It can't, because it doesn't know what happens in the function that uses a and b. Same for a: the compiler doesn't know that a isn't used anymore.
As far as the compiler is concerned, the address of a could e.g. have ben stored by setup0 in a global variable and will be used by setup1 when it is called with b.
The short answer is: No! The compiler cannot optimize this code to what you suggested, because it is not semantically equivalent.
Long explenation: The lifetime of a and b is with some simplification the complete block.
So now lets assume, that one of setup_0 or use_0 stores a pointer to a in some global variable. Now setup_1 and use_1 are allowed to use a via this global variable in combination with b (It can for example add the array elements of a and b. If the transformation you suggested of the code was done, this would result in undefined behaviour. If you really want to make a statement about the lifetime, you have to write the code in the following way:
{
{ // Lifetime block for a
char a[100];
setup_0(a);
use_0(a);
} // Lifetime of a ends here, so no one of the following called
// function is allowed to access it. If it does access it by
// accident it is undefined behaviour
char b[100];
setup_1(b); // Not allowed to access a
use_1(b); // Not allowed to access a
}
Please also note that gcc 12.x and clang 15 both do the optimization. If you comment out the curly brackets, the optimization is (correctly!) not done.
Yes, theoretically, a compiler could optimize the code as you describe, assuming that it could prove that these functions did not modify the arrays passed in as parameters.
But in practice, no, that does not happen. You can write a simple test case to verify this. I've avoided defining the helper functions so the compiler can't inline them, but passed the arrays by const-reference to ensure that the compiler knows the functions don't modify them:
void setup_0(const int (&p)[10]);
void use_0 (const int (&p)[10]);
void setup_1(const int (&p)[10]);
void use_1 (const int (&p)[10]);
void TestFxn()
{
int a[10];
int b[10];
setup_0(a);
use_0(a);
setup_1(b);
use_1(b);
}
As you can see here on Godbolt's Compiler Explorer, no compilers (GCC, Clang, ICC, nor MSVC) will optimize this to use a single stack-allocated array of 10 elements. Of course, each compiler varies in how much space it allocates on the stack. Some of that is due to different calling conventions, which may or may not require a red zone. Otherwise, it's due to the optimizer's alignment preferences.
Taking GCC's output as an example, you can immediately tell that it is not reusing the array a. The following is the disassembly, with my annotations:
; Allocate 104 bytes on the stack
; by subtracting from the stack pointer, RSP.
; (The stack always grows downward on x86.)
sub rsp, 104
; Place the address of the top of the stack in RDI,
; which is how the array is passed to setup_0().
mov rdi, rsp
call setup_0(int const (&) [10])
; Since setup_0() may have clobbered the value in RDI,
; "refresh" it with the address at the top of the stack,
; and call use_0().
mov rdi, rsp
call use_0(int const (&) [10])
; We are now finished with array 'a', so add 48 bytes
; to the top of the stack (RSP), and place the result
; in the RDI register.
lea rdi, [rsp+48]
; Now, RDI contains what is effectively the address of
; array 'b', so call setup_1().
; The parameter is passed in RDI, just like before.
call setup_1(int const (&) [10])
; Second verse, same as the first: "refresh" the address
; of array 'b' in RDI, since it might have been clobbered,
; and pass it to use_1().
lea rdi, [rsp+48]
call use_1(int const (&) [10])
; Clean up the stack by adding 104 bytes to compensate for the
; same 104 bytes that we subtracted at the top of the function.
add rsp, 104
ret
So, what gives? Are compilers just massively missing the boat here when it comes to an important optimization? No. Allocating space on the stack is extremely fast and cheap. There would be very little benefit in allocating ~50 bytes, as opposed to ~100 bytes. Might as well just play it safe and allocate enough space for both arrays separately.
There might be more of a benefit in reusing the stack space for the second array if both arrays were extremely large, but empirically, compilers don't do this, either.
Does this work with dynamic memory allocation? No. Emphatically no. I've never seen a compiler that optimizes around dynamic memory allocation like this, and I don't expect to see one. It just doesn't make sense. If you wanted to re-use the block of memory, you would have written the code to re-use it instead of allocating a separate block.
I suppose you are thinking that if you had something like the following C code:
void TestFxn()
{
int* a = malloc(sizeof(int) * 10);
setup_0(a);
use_0(a);
free(a);
int* b = malloc(sizeof(int) * 10);
setup_1(b);
use_1(b);
free(b);
}
that the optimizer could see that you were freeing a, and then immediately re-allocating a block of the same size as b? Well, the optimizer won't recognize this and elide the back-to-back calls to free and malloc, but the run-time library (and/or operating system) very likely will. free is a very cheap operation, and since a block of the appropriate size was just released, allocation will also be very cheap. (Most run-time libraries maintain a private heap for the application and won't even return the memory to the operating system, so depending on the memory-allocation strategy, it's even possible that you get the exact same block back.)