I'm wondering if there could be a problem with putting normal_distribution in a loop.
Here is the code that uses normal_distribution in this strange way:
std::default_random_engine generator;
//std::normal_distribution<double> distribution(5.0,2.0);
for (int i=0; i<nrolls; ++i) {
std::normal_distribution<double> distribution(5.0,2.0);
float x = distribution(generator);
}
Putting the normal_distribution object outside the loop may be slightly more efficient than putting it in the loop. When it's inside the loop, the normal_distribution object may be re-constructed every time, whereas if it's outside the loop it's only constructed once.
Comparison of the assembly.
Based on an analysis of the assembly, declaring distribution outside the loop is more efficient.
Let's look at two different functions, along with the corresponding assembly. One of them declares distribution inside the loop, and the other one declares it outside the loop. To simplify the analysis, they're declared const in both cases, so we (and the compiler) know that the distribution doesn't get modified.
You can see the complete assembly here.
// This function is here to prevent the compiler from optimizing out the
// loop entirely
void doSomething(std::normal_distribution<double> const& d) noexcept;
void inside_loop(double mean, double sd, int n) {
for(int i = 0; i < n; i++) {
const std::normal_distribution<double> d(mean, sd);
doSomething(d);
}
}
void outside_loop(double mean, double sd, int n) {
const std::normal_distribution<double> d(mean, sd);
for(int i = 0; i < n; i++) {
doSomething(d);
}
}
inside_loop assembly
The assembly for the loop looks like this (compiled with gcc 8.3 at O3 optimization).
.L3:
movapd xmm2, XMMWORD PTR [rsp]
lea rdi, [rsp+16]
add ebx, 1
mov BYTE PTR [rsp+40], 0
movaps XMMWORD PTR [rsp+16], xmm2
call foo(std::normal_distribution<double> const&)
cmp ebp, ebx
jne .L3
Basically, it
- constructs the distribution
- invokes foo with the distribution
- tests to see if it should exit the loop
outside_loop assembly
Using the same compilation options, outside_loop just calls foo repeatedly without re-constructing the distribution. There's fewer instructions, and everything stays within the registers (so no need to access the stack).
.L12:
mov rdi, rsp
add ebx, 1
call foo(std::normal_distribution<double> const&)
cmp ebp, ebx
jne .L12
Are there ever any reasons to declare variables inside a loop?
Yes. There are definitely good times to declare variables inside a loop. If you were modifying distribution somehow inside the loop, then it would make sense to reset it every time just by constructing it again.
Furthermore, if you don't ever use a variable outside of a loop, it makes sense to declare it inside the loop just for the purposes of readability.
Types that fit inside a CPU's registers (so floats, ints, doubles, and small user-defined types) oftentimes have no overhead associated with their construction, and declaring them inside a loop can actually lead to better assembly by simplifying compiler analysis of register allocation.
Looking at the interface of the normal distribution, there is a member called reset, who:
resets the internal state of the distribution
This implies that the distribution may have an internal state. If it does, then you definitely reset that when you recreate the object at each iteration. Not using it as intended may produce a distribution which is not normal or might be just inefficient.
What state could it be? That is certainly implementation defined. Looking at one implementation from LLVM, the normal distribution is defined around here. More specifically, the operator() is here. Looking at the code, there is certainly some state shared in between subsequent calls. More specifically, at each subsequent call, the state of the boolean variable _V_hot_ is flipped. If it is true, significantly less computations are performed and the value of the stored _V_ is used. If it is false, then _V_ is computed from scratch.
I did not look very deep into why did they choose to do this. But, looking only at the computations performed, it should be much faster to rely on the internal state. While this is only some implementation, it shows that the standard allows the usage of internal state, and in some case it is beneficial.
Later edit:
The GCC libstdc++ implementation of std::normal_distribution can be found here. Note that the operator() calls another function, __generate_impl, which is defined in a separate file here. While different, this implementation has the same flag, here named _M_saved_available that speeds up every other call.
Related
I'm curious about the underlying implementation of static variables within a function.
If I declare a static variable of a fundamental type (char, int, double, etc.), and give it an initial value, I imagine that the compiler simply sets the value of that variable at the very beginning of the program before main() is called:
void SomeFunction();
int main(int argCount, char ** argList)
{
// at this point, the memory reserved for 'answer'
// already contains the value of 42
SomeFunction();
}
void SomeFunction()
{
static int answer = 42;
}
However, if the static variable is an instance of a class:
class MyClass
{
//...
};
void SomeFunction();
int main(int argCount, char ** argList)
{
SomeFunction();
}
void SomeFunction()
{
static MyClass myVar;
}
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
static bool initialized = 0;
if (!initialized)
{
// construct myVar
initialized = 1;
}
This question covered similar ground, but thread safety wasn't mentioned. For what it's worth, C++0x will make function static initialisation thread safe.
(see the C++0x FCD, 6.7/4 on function statics: "If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization.")
One other thing that hasn't been mentioned is that function statics are destructed in reverse order of their construction, so the compiler maintains a list of destructors to call on shutdown (this may or may not be the same list that atexit uses).
In the compiler output I have seen, function local static variables are initialized exactly as you imagine.
(Caveat: This paragraph applies to C++ versions older than C++11. See the comments for changes since C++11.) Note that in general this is not done in a thread-safe manner. So if you have functions with static locals like that that might be called from multiple threads, you should take this into account. Calling the function once in the main thread before any others are called will usually do the trick.
I should add that if the initialization of the local static is by a simple constant like in your example, the compiler doesn't need to go through these gyrations - it can just initialize the variable in the image or before main() like a regular static initialization (because your program wouldn't be able to tell the difference). But if you initialize it with a function's return value, then the compiler pretty much has to test a flag indicating if the initialization has been done or something equivalent.
You're right about everything, including the initialized flag as a common implementation. This is basically why initialization of static locals is not thread-safe, and why pthread_once exists.
One slight caveat: the compiler must emit code which "behaves as if" the static local variable is constructed the first time it is used. Since integer initialization has no side effects (and calls no user code), it's up to the compiler when it initializes the int. User code cannot "legitimately" find out what it does.
Obviously you can look at the assembly code, or provoke undefined behaviour and make deductions from what actually happens. But the C++ standard doesn't count that as valid grounds to claim that the behaviour is not "as if" it did what the spec says.
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
Yes, that's right: and, FWIW, it's not necessarily thread-safe (if the function is called "for the first time" by two threads simultaneously).
For that reason you might prefer to define the variable at global scope (although maybe in a class or namespace, or static without external linkage) instead of inside a function, so that it's initialized before the program starts without any run-time "if".
Another twist is in embedded code, where the run-before-main() code (cinit/whatever) may copy pre-initialized data (both statics and non-statics) into ram from a const data segment, perhaps residing in ROM. This is useful where the code may not be running from some sort of backing store (disk) where it can be re-loaded from. Again, this doesn't violate the requirements of the language, since this is done before main().
Slight tangent: While I've not seen it done much (outside of Emacs), a program or compiler could basically run your code in a process and instantiate/initialize objects, then freeze and dump the process. Emacs does something similar to this to load up large amounts of elisp (i.e. chew on it), then dump the running state as the working executable, to avoid the cost of parsing on each invocation.
The relevant thing isn't being a class type or not, it's compile-time evaluation of the initializer (at the current optimization level). And of course the constructor not having any side-effects, if it's non-trivial.
If it's not possible to simply put a constant value in .data, gcc/clang use an acquire load of a guard variable to check that static locals have been initialized. If the guard variable is false, then they pick one thread to do the initializing, and have other threads wait for it if they also see a false guard variable. They've been doing this for a long time, since before C++11 required it. (e.g. as old as GCC4.1 on Godbolt, from May 2006.)
Does a function local static variable automatically incur a branch? shows what GCC does.
Cost of thread-safe local static variable initialization in C++11? same
Why does initialization of local static objects use hidden guard flags? same
The most simple artificial example, snapshotting the arg from the first call and ignoring later args:
int foo(int a){
static int x = a;
return x;
}
Compiles for x86-64 with GCC11.3 -O3 (Godbolt), with the exact same asm generated for -std=gnu++03 mode. GCC4.1 also makes about the same asm, but doesn't keep the push/pop off the fast path (i.e. missing shrink-wrap optimization). GCC4.1 only supported AT&T syntax output, so it visually looks different unless you flip modern GCC to AT&T mode as well, but this is Intel syntax (destination on the left).
# demangled asm from g++ -O3
foo(int):
movzx eax, BYTE PTR guard variable for foo(int)::x[rip] # guard.load(acquire)
test al, al
je .L13
mov eax, DWORD PTR foo(int)::x[rip] # normal load of the static local
ret # fast path through the function is the already-initialized case
.L13: # jumps here on guard == 0, on the first call (and any that race with it)
# It would be sensible for GCC to put this code in .text.cold
push rbx
mov ebx, edi # save function arg in a call-preserved reg
mov edi, OFFSET FLAT:guard variable for foo(int)::x # address
call __cxa_guard_acquire # guard_acquire(&guard_x) presumably a normal mutex or spinlock
test eax, eax
jne .L14 # if (we won the race to do the init work) goto .L14
mov eax, DWORD PTR foo(int)::x[rip] # else it's done now by another thread
pop rbx
ret
.L14:
mov edi, OFFSET FLAT:guard variable for foo(int)::x
mov DWORD PTR foo(int)::x[rip], ebx # init static x (from a saved in RBX)
call __cxa_guard_release
mov eax, DWORD PTR foo(int)::x[rip] # missed optimization: mov eax, ebx
# This thread is the one that just initialized it, our function arg is the value.
# It's not atomic (or volatile), so another thread can't have set it, too.
pop rbx
ret
If compiling for AArch64, the load of the guard variable is ldarb w8, [x8], a load with acquire semantics. Other ISAs might need a plain load and then a barrier to give at least LoadLoad ordering, to make sure they load the payload x no earlier than when they saw the guard variable being non-zero.
If the static variable has a constant initializer, no guard is needed
int bar(int a){
static int x = 1;
return ++x + a;
}
bar(int):
mov eax, DWORD PTR bar(int)::x[rip]
add eax, 1
mov DWORD PTR bar(int)::x[rip], eax # store the updated value
add eax, edi # and add it to the function arg
ret
.section .data
bar(int)::x:
.long 1
I wrote a simple multithreading programs as follows:
static bool finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.
Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.
My fix:
#include <iostream>
#include <future>
#include <atomic>
static std::atomic<bool> finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Output:
result =1023045342
main thread id=140147660588864
Live Demo on coliru
Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)
But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.
Making a bool unguarded, non-atomic can cause additional issues:
The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)
To prevent this to happen, the compiler must be told explicitly not to do.
I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:
Is volatile useful with threads
Who's afraid of a big bad optimizing compiler?.
Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.
I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:
func():
cmp BYTE PTR finished[rip], 0
jne .L4
.L5:
jmp .L5
.L4:
mov eax, 0
ret
So, what is happening here?
First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.
If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.
If it is false however, we move to
.L5:
jmp .L5
This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.
In other words, the thread is put into an infinite busy loop.
So why has this happened?
As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.
Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.
The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.
at -O0 the compiler (as expected) does not optimise the loop body and comparison away:
func():
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 0
.L148:
movzx eax, BYTE PTR finished[rip]
test al, al
jne .L147
add QWORD PTR [rbp-8], 1
jmp .L148
.L147:
mov rax, QWORD PTR [rbp-8]
pop rbp
ret
therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.
A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.
For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.
Here is an example:
class ST {
public:
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
void setFinished(bool val)
{
finished = val;
}
private:
std::atomic<bool> finished = false;
};
int main()
{
ST st;
auto result=std::async(std::launch::async, &ST::func, std::ref(st));
std::this_thread::sleep_for(std::chrono::seconds(1));
st.setFinished(true);
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Live on wandbox
Suppose I have a function double F(double x) and let's assume for the sake of this example that calls to F are costly.
Suppose I write a function f that calculates the square root of F:
double f(double x){
return sqrt(F(x));
}
and in a third function sum I calculate the sum of f and F:
double sum(double x){
return F(x) + f(x);
}
Since I want to minimise calls to F the above code is inefficient compared to e.g.
double sum_2(double x){
double y = F(x);
return y + sqrt(y);
}
But since I am lazy, or stupid, or want to make my code as clear as possible, I opted for the first definition instead.
Would a C/C++ compiler optimise my code anyway by realizing that the value of F(x) can be reused to calculate f(x), as it is done in sum_2?
Many thanks.
Would a C/C++ compiler optimise my code anyway by realizing that the value of F(x) can be reused to calculate f(x), as it is done in sum_2?
Maybe. Neither language requires such an optimization, and whether they allow it depends on details of the implementation of F(). Generally speaking, different compilers behave differently with respect to this sort of thing.
It is entirely plausible that a compiler would inline function f() into function sum(), which would give it the opportunity to recognize that there are two calls to F(x) contributing to the same result. In that case, if F() has no side effects then it is conceivable that the compiler would emit only a single call to F(), reusing the result.
Particular implementations may have extensions that can be employed to help the compiler come to such a conclusion. Without such an extension being applied to the problem, however, I rate it unlikely that a compiler would emit code that performs just one call to F() and reuses the result.
What you're describing is called memoization, a form of (usually) run-time caching. While it's possible for this to be implemented in compilers, it's most often not performed in C compilers.
C++ does have a clever workaround for this, using the STL, detailed at this blog post from a few years ago; there's also a slightly more recent SO answer here. It's worth noting that with this approach, the compiler isn't "smartly" inferring that a function's multiple identical results will be reused, but the effect is largely the same.
Some languages, like Haskell, do feature baked-in support for compile-time memoization, but the compiler architecture is fairly different from Clang or GCC/G++.
Many compilers use hints to figure out if a result of a previous function call may be reused. A classical example is:
for (int i=0; i < strlen(str); i++)
Without optimizing this, the complexity of this loop is at least O(n2), but after optimization it can be O(n).
The hints that gcc, clang, and many others can take are __attribute__((pure)) and __attribute__((const)) which are described here. For example, GNU strlen is declared as a pure function.
GCC can detect pure functions, and suggest the programmer which functions should be married pure. In fact, it does that automatically for the following simplistic example:
unsigned my_strlen(const char* str)
{
int i=0;
while (str[i])
++i;
return i;
}
unsigned word_len(const char *str)
{
for (unsigned i=0 ; i < my_strlen(str); ++i) {
if (str[i] == ' ')
return i;
}
return my_strlen(str);
}
You can see the compilation result for gcc with -O3 -fno-inline. It calls my_strlen(str) only once in the whole word_len function. Clang 7.0.0 does not seem to perform this optimization.
word_len:
mov rcx, rdi
call my_strlen ; <-- this is the only call (outside any loop)
test eax, eax
je .L7
xor edx, edx
cmp BYTE PTR [rcx], 32
lea rdi, [rdi+1]
jne .L11
jmp .L19
.L12:
add rdi, 1
cmp BYTE PTR [rdi-1], 32
je .L9
.L11:
add edx, 1
cmp eax, edx
jne .L12
.L7:
ret
.L19:
xor edx, edx
.L9:
mov eax, edx
ret
I'm trying to implement a simple busy loop function.
This should keep polling a std::atomic variable for a maximum number of times (spinCount), and return true if the status did change (to anything other than NOT_AVAILABLE) within the given tries, or false otherwise:
// noinline is just to be able to inspect the resulting ASM a bit easier - in final code, this function SHOULD be inlined!
__declspec(noinline) static bool trySpinWait(std::atomic<Status>* statusPtr, const int spinCount)
{
int iSpinCount = 0;
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
return iSpinCount == spinCount;
}
However, it seems that MSVC just opitmizes the loop away on Release mode for Win64. I'm pretty bad with Assembly, but doesn't look to me like it's ever even trying to read the value of statusPtr at all:
int iSpinCount = 0;
000000013F7E2040 xor eax,eax
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
000000013F7E2042 inc eax
000000013F7E2044 cmp eax,edx
000000013F7E2046 jge trySpinWait+12h (013F7E2052h)
000000013F7E2048 mov r8d,dword ptr [rcx]
000000013F7E204B test r8d,r8d
000000013F7E204E je trySpinWait+2h (013F7E2042h)
return iSpinCount == spinCount;
000000013F7E2050 cmp eax,edx
000000013F7E2052 sete al
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this, but seems that's not the case (or rather, my understanding was probably wrong).
What am I doing wrong here, or rather - how can I best implement that loop without having it optimized away, with least impact on overall performance?
I know I could use #pragma optimize( "", off ), but (other than in the example above), in my final code I'd very much like to have this call inlined into a larger function for performance reasons. seems that this #pragma will generally prevent inlining though.
Appreciate any thoughts!
Thanks
but doesn't look to me like it's ever even trying to read the value of statusPtr at all
It does reload it on every iteration of the loop:
000000013F7E2048 mov r8d,dword ptr [rcx] # rcx is statusPtr
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this,
You do not need anything more than std::memory_order_relaxed here because there is only one variable shared between threads (even more, this code doesn't change the value of the atomic variable). There are no reordering concerns.
In other words, this function works as expected.
You may like to use PAUSE instruction, see Benefitting Power and Performance Sleep Loops.
I'm curious about the underlying implementation of static variables within a function.
If I declare a static variable of a fundamental type (char, int, double, etc.), and give it an initial value, I imagine that the compiler simply sets the value of that variable at the very beginning of the program before main() is called:
void SomeFunction();
int main(int argCount, char ** argList)
{
// at this point, the memory reserved for 'answer'
// already contains the value of 42
SomeFunction();
}
void SomeFunction()
{
static int answer = 42;
}
However, if the static variable is an instance of a class:
class MyClass
{
//...
};
void SomeFunction();
int main(int argCount, char ** argList)
{
SomeFunction();
}
void SomeFunction()
{
static MyClass myVar;
}
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
static bool initialized = 0;
if (!initialized)
{
// construct myVar
initialized = 1;
}
This question covered similar ground, but thread safety wasn't mentioned. For what it's worth, C++0x will make function static initialisation thread safe.
(see the C++0x FCD, 6.7/4 on function statics: "If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization.")
One other thing that hasn't been mentioned is that function statics are destructed in reverse order of their construction, so the compiler maintains a list of destructors to call on shutdown (this may or may not be the same list that atexit uses).
In the compiler output I have seen, function local static variables are initialized exactly as you imagine.
(Caveat: This paragraph applies to C++ versions older than C++11. See the comments for changes since C++11.) Note that in general this is not done in a thread-safe manner. So if you have functions with static locals like that that might be called from multiple threads, you should take this into account. Calling the function once in the main thread before any others are called will usually do the trick.
I should add that if the initialization of the local static is by a simple constant like in your example, the compiler doesn't need to go through these gyrations - it can just initialize the variable in the image or before main() like a regular static initialization (because your program wouldn't be able to tell the difference). But if you initialize it with a function's return value, then the compiler pretty much has to test a flag indicating if the initialization has been done or something equivalent.
You're right about everything, including the initialized flag as a common implementation. This is basically why initialization of static locals is not thread-safe, and why pthread_once exists.
One slight caveat: the compiler must emit code which "behaves as if" the static local variable is constructed the first time it is used. Since integer initialization has no side effects (and calls no user code), it's up to the compiler when it initializes the int. User code cannot "legitimately" find out what it does.
Obviously you can look at the assembly code, or provoke undefined behaviour and make deductions from what actually happens. But the C++ standard doesn't count that as valid grounds to claim that the behaviour is not "as if" it did what the spec says.
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
Yes, that's right: and, FWIW, it's not necessarily thread-safe (if the function is called "for the first time" by two threads simultaneously).
For that reason you might prefer to define the variable at global scope (although maybe in a class or namespace, or static without external linkage) instead of inside a function, so that it's initialized before the program starts without any run-time "if".
Another twist is in embedded code, where the run-before-main() code (cinit/whatever) may copy pre-initialized data (both statics and non-statics) into ram from a const data segment, perhaps residing in ROM. This is useful where the code may not be running from some sort of backing store (disk) where it can be re-loaded from. Again, this doesn't violate the requirements of the language, since this is done before main().
Slight tangent: While I've not seen it done much (outside of Emacs), a program or compiler could basically run your code in a process and instantiate/initialize objects, then freeze and dump the process. Emacs does something similar to this to load up large amounts of elisp (i.e. chew on it), then dump the running state as the working executable, to avoid the cost of parsing on each invocation.
The relevant thing isn't being a class type or not, it's compile-time evaluation of the initializer (at the current optimization level). And of course the constructor not having any side-effects, if it's non-trivial.
If it's not possible to simply put a constant value in .data, gcc/clang use an acquire load of a guard variable to check that static locals have been initialized. If the guard variable is false, then they pick one thread to do the initializing, and have other threads wait for it if they also see a false guard variable. They've been doing this for a long time, since before C++11 required it. (e.g. as old as GCC4.1 on Godbolt, from May 2006.)
Does a function local static variable automatically incur a branch? shows what GCC does.
Cost of thread-safe local static variable initialization in C++11? same
Why does initialization of local static objects use hidden guard flags? same
The most simple artificial example, snapshotting the arg from the first call and ignoring later args:
int foo(int a){
static int x = a;
return x;
}
Compiles for x86-64 with GCC11.3 -O3 (Godbolt), with the exact same asm generated for -std=gnu++03 mode. GCC4.1 also makes about the same asm, but doesn't keep the push/pop off the fast path (i.e. missing shrink-wrap optimization). GCC4.1 only supported AT&T syntax output, so it visually looks different unless you flip modern GCC to AT&T mode as well, but this is Intel syntax (destination on the left).
# demangled asm from g++ -O3
foo(int):
movzx eax, BYTE PTR guard variable for foo(int)::x[rip] # guard.load(acquire)
test al, al
je .L13
mov eax, DWORD PTR foo(int)::x[rip] # normal load of the static local
ret # fast path through the function is the already-initialized case
.L13: # jumps here on guard == 0, on the first call (and any that race with it)
# It would be sensible for GCC to put this code in .text.cold
push rbx
mov ebx, edi # save function arg in a call-preserved reg
mov edi, OFFSET FLAT:guard variable for foo(int)::x # address
call __cxa_guard_acquire # guard_acquire(&guard_x) presumably a normal mutex or spinlock
test eax, eax
jne .L14 # if (we won the race to do the init work) goto .L14
mov eax, DWORD PTR foo(int)::x[rip] # else it's done now by another thread
pop rbx
ret
.L14:
mov edi, OFFSET FLAT:guard variable for foo(int)::x
mov DWORD PTR foo(int)::x[rip], ebx # init static x (from a saved in RBX)
call __cxa_guard_release
mov eax, DWORD PTR foo(int)::x[rip] # missed optimization: mov eax, ebx
# This thread is the one that just initialized it, our function arg is the value.
# It's not atomic (or volatile), so another thread can't have set it, too.
pop rbx
ret
If compiling for AArch64, the load of the guard variable is ldarb w8, [x8], a load with acquire semantics. Other ISAs might need a plain load and then a barrier to give at least LoadLoad ordering, to make sure they load the payload x no earlier than when they saw the guard variable being non-zero.
If the static variable has a constant initializer, no guard is needed
int bar(int a){
static int x = 1;
return ++x + a;
}
bar(int):
mov eax, DWORD PTR bar(int)::x[rip]
add eax, 1
mov DWORD PTR bar(int)::x[rip], eax # store the updated value
add eax, edi # and add it to the function arg
ret
.section .data
bar(int)::x:
.long 1