Are function-local static mutexes thread-safe? - c++

In the following program I attempt the make the print function thread-safe by using a function-local mutex object:
#include <iostream>
#include <chrono>
#include <mutex>
#include <string>
#include <thread>
void print(const std::string & s)
{
// Thread safe?
static std::mutex mtx;
std::unique_lock<std::mutex> lock(mtx);
std::cout <<s << std::endl;
}
int main()
{
std::thread([&](){ for (int i = 0; i < 10; ++i) print("a" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("b" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("c" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("d" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("e" + std::to_string(i)); }).detach();
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
Is this safe?
My doubts arise from this question, which presents a similar case.

C++11
In C++11 and later versions: yes, this pattern is safe. In particular, initialization of function-local static variables is thread-safe, so your code above works safely across threads.
This way this works in practice is that the compiler inserts any necessary boilerplate in the function itself to check if the variable is initialized prior to access. In the case of std::mutex as implemented in gcc, clang and icc, however, the initialized state is all-zeros, so no explicit initialization is needed (the variable will live in the all-zeros .bss section so the initialization is "free"), as we see from the assembly1:
inc(int& i):
mov eax, OFFSET FLAT:_ZL28__gthrw___pthread_key_createPjPFvPvE
test rax, rax
je .L2
push rbx
mov rbx, rdi
mov edi, OFFSET FLAT:_ZZ3incRiE3mtx
call _ZL26__gthrw_pthread_mutex_lockP15pthread_mutex_t
test eax, eax
jne .L10
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:_ZZ3incRiE3mtx
pop rbx
jmp _ZL28__gthrw_pthread_mutex_unlockP15pthread_mutex_t
.L2:
add DWORD PTR [rdi], 1
ret
.L10:
mov edi, eax
call _ZSt20__throw_system_errori
Note that starting at the line mov edi, OFFSET FLAT:_ZZ3incRiE3mtx it simply loads the address of the inc::mtx function-local static and calls pthread_mutex_lock on it, without any initialization. The code before that dealing with pthread_key_create is apparently just checking if the pthreads library is present at all.
There's not guarantee, however, that all implementations will implement std::mutex as all-zeros, so you might in some cases incur ongoing overhead on each call to check if the mutex has been initialized. Declaring the mutex outside the function would avoid that.
Here's an example contrasting the two approaches with a stand-in mutex2 class with a non-inlinable constructor (so the compiler can't determine that the initial state is all-zeros):
#include <mutex>
class mutex2 {
public:
mutex2();
void lock();
void unlock();
};
void inc_local(int &i)
{
// Thread safe?
static mutex2 mtx;
std::unique_lock<mutex2> lock(mtx);
i++;
}
mutex2 g_mtx;
void inc_global(int &i)
{
std::unique_lock<mutex2> lock(g_mtx);
i++;
}
The function-local version compiles (on gcc) to:
inc_local(int& i):
push rbx
movzx eax, BYTE PTR _ZGVZ9inc_localRiE3mtx[rip]
mov rbx, rdi
test al, al
jne .L3
mov edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
call __cxa_guard_acquire
test eax, eax
jne .L12
.L3:
mov edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
call _ZN6mutex24lockEv
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
pop rbx
jmp _ZN6mutex26unlockEv
.L12:
mov edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
call _ZN6mutex2C1Ev
mov edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
call __cxa_guard_release
jmp .L3
mov rbx, rax
mov edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
call __cxa_guard_abort
mov rdi, rbx
call _Unwind_Resume
Note the large amount of boilerplate dealing with the __cxa_guard_* functions. First, a rip-relative flag byte, _ZGVZ9inc_localRiE3mtx2 is checked and if non-zero, the variable has already been initialized and we are done and fall into the fast-path. No atomic operations are needed because on x86, loads already have the needed acquire semantics.
If this check fails, we go to the slow path, which is essentially a form of double-checked locking: the initial check is not sufficient to determine that the variable needs initialization because two or more threads may be racing here. The __cxa_guard_acquire call does the locking and the second check, and may either fall through to the fast path as well (if another thread concurrently initialized the object), or may jump dwon to the actual initialization code at .L12.
Finally note that the last 5 instructions in the assembly aren't direct reachable from the function at all as they are preceded by an unconditional jmp .L3 and nothing jumps to them. They are there to be jumped to by an exception handler should the call to the constructor mutex2() throw an exception at some point.
Overall, we can say at the runtime cost of the first-access initialization is low to moderate because the fast-path only checks a single byte flag without any expensive instructions (and the remainder of the function itself usually implies at least two atomic operations for mutex.lock() and mutex.unlock(), but it comes at a significant code size increase.
Compare to the global version, which is identical except that initailization happens during global initialization rather than before first access:
inc_global(int& i):
push rbx
mov rbx, rdi
mov edi, OFFSET FLAT:g_mtx
call _ZN6mutex24lockEv
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:g_mtx
pop rbx
jmp _ZN6mutex26unlockEv
The function is less than a third of the size without any initialization boilerplate at all.
Prior to C++11
Prior to C++11, however, this is generally not safe, unless your compiler makes some special guarantees about the way in which static locals are initialized.
Some time ago, while looking at a similar issue, I examined the assembly generated by Visual Studio for this case. The pseudocode for the generated assembly code for your print method looked something like this:
void print(const std::string & s)
{
if (!init_check_print_mtx) {
init_check_print_mtx = true;
mtx.mutex(); // call mutex() ctor for mtx
}
// ... rest of method
}
The init_check_print_mtx is a compiler generated global variable specific to this method which tracks whether the local static has been initialized. Note that inside the "one time" initialize block guarded by this variable, that the variable is set to true before the mutex is initialized.
I though this was silly since it ensures that other threads racing into this method will skip the initializer and use a uninitialized mtx - versus the alternative of possibly initializing mtx more than once - but in fact doing it this way allows you to avoid the infinite recursion issue that occurs if std::mutex() were to call back into print, and this behavior is in fact mandated by the standard.
Nemo above mentions that this has been fixed (more precisely, re-specified) in C++11 to require a wait for all racing threads, which would make this safe, but you'll need to check your own compiler for compliance. I didn't check if in fact the new spec includes this guarantee, but I wouldn't be at all surprised given that local statics were pretty much useless in multi-threaded environments without this (except perhaps for primitive values which didn't have any check-and-set behavior because they just referred directly to an already initialized location in the .data segment).
1 Note that I changed the print() function to a slightly simpler inc() function that just increments an integer in the locked region. This has the same locking structure and implications as the original, but avoids a bunch of code dealing with the << operators and std::cout.
2 Using c++filt this de-mangles to guard variable for inc_local(int&)::mtx.

This is not the same as the linked question for several reasons.
The linked question is not C++11, but yours is. In C++11 initialization of function-local static variables is always safe. Prior to C++11 it was only safe with some compilers e.g. GCC and Clang default to thread-safe initialization.
The linked question initializes the reference by calling a function, which is dynamic initialization and happens at run-time. The default constructor for std::mutex is constexpr so your static variable has constant initialization, i.e. the mutex can be initialized at compile-time (or link-time) so there is nothing to do dynamically at runtime. Even if multiple threads call the function concurrently there's nothing they actually need to do before using the mutex.
Your code is safe (assuming your compiler implements the C++11 rules correctly.)

As long as the mutex is static, yes.
Local, nonstatic would defintely NOT be safe. Unless all your threads use the same stack, which also means you've now invented memory where one cell can hold many different values at the same time, and are just waiting for the Nobel committee to notify you for the next Nobel prize.
You must have some sort of "global" (shared) memory space for mutexes.

Related

When is a static class variable defined in a function initialised [duplicate]

I'm curious about the underlying implementation of static variables within a function.
If I declare a static variable of a fundamental type (char, int, double, etc.), and give it an initial value, I imagine that the compiler simply sets the value of that variable at the very beginning of the program before main() is called:
void SomeFunction();
int main(int argCount, char ** argList)
{
// at this point, the memory reserved for 'answer'
// already contains the value of 42
SomeFunction();
}
void SomeFunction()
{
static int answer = 42;
}
However, if the static variable is an instance of a class:
class MyClass
{
//...
};
void SomeFunction();
int main(int argCount, char ** argList)
{
SomeFunction();
}
void SomeFunction()
{
static MyClass myVar;
}
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
static bool initialized = 0;
if (!initialized)
{
// construct myVar
initialized = 1;
}
This question covered similar ground, but thread safety wasn't mentioned. For what it's worth, C++0x will make function static initialisation thread safe.
(see the C++0x FCD, 6.7/4 on function statics: "If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization.")
One other thing that hasn't been mentioned is that function statics are destructed in reverse order of their construction, so the compiler maintains a list of destructors to call on shutdown (this may or may not be the same list that atexit uses).
In the compiler output I have seen, function local static variables are initialized exactly as you imagine.
(Caveat: This paragraph applies to C++ versions older than C++11. See the comments for changes since C++11.) Note that in general this is not done in a thread-safe manner. So if you have functions with static locals like that that might be called from multiple threads, you should take this into account. Calling the function once in the main thread before any others are called will usually do the trick.
I should add that if the initialization of the local static is by a simple constant like in your example, the compiler doesn't need to go through these gyrations - it can just initialize the variable in the image or before main() like a regular static initialization (because your program wouldn't be able to tell the difference). But if you initialize it with a function's return value, then the compiler pretty much has to test a flag indicating if the initialization has been done or something equivalent.
You're right about everything, including the initialized flag as a common implementation. This is basically why initialization of static locals is not thread-safe, and why pthread_once exists.
One slight caveat: the compiler must emit code which "behaves as if" the static local variable is constructed the first time it is used. Since integer initialization has no side effects (and calls no user code), it's up to the compiler when it initializes the int. User code cannot "legitimately" find out what it does.
Obviously you can look at the assembly code, or provoke undefined behaviour and make deductions from what actually happens. But the C++ standard doesn't count that as valid grounds to claim that the behaviour is not "as if" it did what the spec says.
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
Yes, that's right: and, FWIW, it's not necessarily thread-safe (if the function is called "for the first time" by two threads simultaneously).
For that reason you might prefer to define the variable at global scope (although maybe in a class or namespace, or static without external linkage) instead of inside a function, so that it's initialized before the program starts without any run-time "if".
Another twist is in embedded code, where the run-before-main() code (cinit/whatever) may copy pre-initialized data (both statics and non-statics) into ram from a const data segment, perhaps residing in ROM. This is useful where the code may not be running from some sort of backing store (disk) where it can be re-loaded from. Again, this doesn't violate the requirements of the language, since this is done before main().
Slight tangent: While I've not seen it done much (outside of Emacs), a program or compiler could basically run your code in a process and instantiate/initialize objects, then freeze and dump the process. Emacs does something similar to this to load up large amounts of elisp (i.e. chew on it), then dump the running state as the working executable, to avoid the cost of parsing on each invocation.
The relevant thing isn't being a class type or not, it's compile-time evaluation of the initializer (at the current optimization level). And of course the constructor not having any side-effects, if it's non-trivial.
If it's not possible to simply put a constant value in .data, gcc/clang use an acquire load of a guard variable to check that static locals have been initialized. If the guard variable is false, then they pick one thread to do the initializing, and have other threads wait for it if they also see a false guard variable. They've been doing this for a long time, since before C++11 required it. (e.g. as old as GCC4.1 on Godbolt, from May 2006.)
Does a function local static variable automatically incur a branch? shows what GCC does.
Cost of thread-safe local static variable initialization in C++11? same
Why does initialization of local static objects use hidden guard flags? same
The most simple artificial example, snapshotting the arg from the first call and ignoring later args:
int foo(int a){
static int x = a;
return x;
}
Compiles for x86-64 with GCC11.3 -O3 (Godbolt), with the exact same asm generated for -std=gnu++03 mode. GCC4.1 also makes about the same asm, but doesn't keep the push/pop off the fast path (i.e. missing shrink-wrap optimization). GCC4.1 only supported AT&T syntax output, so it visually looks different unless you flip modern GCC to AT&T mode as well, but this is Intel syntax (destination on the left).
# demangled asm from g++ -O3
foo(int):
movzx eax, BYTE PTR guard variable for foo(int)::x[rip] # guard.load(acquire)
test al, al
je .L13
mov eax, DWORD PTR foo(int)::x[rip] # normal load of the static local
ret # fast path through the function is the already-initialized case
.L13: # jumps here on guard == 0, on the first call (and any that race with it)
# It would be sensible for GCC to put this code in .text.cold
push rbx
mov ebx, edi # save function arg in a call-preserved reg
mov edi, OFFSET FLAT:guard variable for foo(int)::x # address
call __cxa_guard_acquire # guard_acquire(&guard_x) presumably a normal mutex or spinlock
test eax, eax
jne .L14 # if (we won the race to do the init work) goto .L14
mov eax, DWORD PTR foo(int)::x[rip] # else it's done now by another thread
pop rbx
ret
.L14:
mov edi, OFFSET FLAT:guard variable for foo(int)::x
mov DWORD PTR foo(int)::x[rip], ebx # init static x (from a saved in RBX)
call __cxa_guard_release
mov eax, DWORD PTR foo(int)::x[rip] # missed optimization: mov eax, ebx
# This thread is the one that just initialized it, our function arg is the value.
# It's not atomic (or volatile), so another thread can't have set it, too.
pop rbx
ret
If compiling for AArch64, the load of the guard variable is ldarb w8, [x8], a load with acquire semantics. Other ISAs might need a plain load and then a barrier to give at least LoadLoad ordering, to make sure they load the payload x no earlier than when they saw the guard variable being non-zero.
If the static variable has a constant initializer, no guard is needed
int bar(int a){
static int x = 1;
return ++x + a;
}
bar(int):
mov eax, DWORD PTR bar(int)::x[rip]
add eax, 1
mov DWORD PTR bar(int)::x[rip], eax # store the updated value
add eax, edi # and add it to the function arg
ret
.section .data
bar(int)::x:
.long 1

Coroutine frame seems to be prematurely marked as destroyed when using address sanitizer

I am trying to write a small and simple coroutine library just to get a more solid understanding of C++20 coroutines. It seems to work fine, but when I compile with clang's adress sanitizer, it throws up on me.
I have narrowed down the issue to the following code example (available with compiler and sanitizer output at https://godbolt.org/z/WqY6Gd), but I still can't make any sense of it.
// namespace coro = std::/std::experimental;
// inlining this suppresses the error
__attribute__((noinline)) void foo(int& i) { i = 0; }
struct task {
struct promise_type {
promise_type() = default;
coro::suspend_always initial_suspend() noexcept { return {}; }
coro::suspend_always final_suspend() noexcept { return {}; }
void unhandled_exception() noexcept { std::terminate(); }
void return_value(int v) noexcept { value = v; }
task get_return_object() {
return task{coro::coroutine_handle<promise_type>::from_promise(*this)};
}
int value{};
};
void Start() { return handle_.resume(); }
int Get() {
auto& promise = handle_.promise();
return promise.value;
}
coro::coroutine_handle<promise_type> handle_;
};
task func() { co_return 3; }
int main() {
auto t = func();
t.Start();
const auto result = t.Get();
foo(t.handle_.promise().value);
// moving this one line down or separating this into a noinline
// function suppresses the error
// removing this removes the stack-use-after-scope, but (rightfully) reports a leak
t.handle_.destroy();
if (result != 3) return 1;
}
Address sanitizer reports use-after-scope (full output available at godbolt, link above).
With some help from lldb, I found out that the error is thrown in main, more precisely: the jump at line 112 in the assembly listing, jne .LBB2_15, jumps to asan's report and never returns. It seems to be inside main's prologue.
As the comments indicate, moving destroy() a line down or calling it in a separate noinline function1 changes the behavior of address sanitizer. The only two explanations to this seem to be undefined behavior and asan throwing a false positive (or -fsanitize=address itself is creating lifetime issues, which is sort of the same in a sense).
At this point I'm fairly certain that there's no UB in the code above: both task and result live on main's stack frame, the promise object lives in the coroutine frame. The frame itself is allocated (on main's stack because no suspend-points) at line 1 of main, and destroyed right before returning, past the last access to it in foo(). The coroutine frame is not destroyed automatically because control never flows off co_await final_suspend(), as per the standard. I've been staring at this code for a while, though, so please forgive me if I missed something obvious.
The assembly generated without sanitation seems to makes sense to me and all the memory access happens within [rsp, rsp+24], as allocated. Futhermore, compiling with -fsanitize=address,undefined, or just -fsanitize=undefined, or simply compiling with gcc with -fsanitize=address reports no errors, which leads me to believe the issue is hidden somewhere in the code generated by asan.
Unfortunately, I can't quite make sense of what exactly happens in the code instrumented by asan, and that's why I'm posting this. I have a general understanding of Address sanitizer's algorithm, but I can't map the assembly memory access/allocations to what's happenning in the C
++ code.
I'm hoping that an answer will help me
Understand where the lifetime issues are hidden, if there are any
Understand what exactly happens in main when compiled with asan, so that a person reading this can have a more clear way of finding what memory access in the C++ code triggered the error, and where (if anywhere) was that memory allocated and freed.
Consistently suppress this particular false positive, and elaborate a bit on what causes it, if the issue really is in asan.
Thanks in advance.
1 This initially lead me to believe that clang's optimizer is reading result from the (destroyed) coroutine's frame directly, but moving destroy() into task's destructor brings the issue back and proves that theory wrong, as far as I can tell. destroy() is not in the destructor in the listing above because it requires implementing move construction/assignment in order to avoid double free, and I wanted to keep the example as small and clear as possible.
I think I figured it out - but mostly because it's already fixed in clang12.0.
Running the smaller/cleaner example with clang-12 shows no error from asan. The difference is in the following lines:
movabs rcx, -866669180174077455
mov qword ptr [r13 + 2147450880], rcx
mov dword ptr [r13 + 2147450888], -202116109
lea rdi, [r12 + 40]
mov rcx, rdi
shr rcx, 3
cmp byte ptr [rcx + 2147450880], 0
jne .LBB2_14
lea r14, [r12 + 32]
mov qword ptr [r14 + 8], offset f() [clone .cleanup]
lea rdi, [r14 + 16]
mov byte ptr [r13 + 2147450884], 0
mov rcx, rdi
shr rcx, 3
mov dl, byte ptr [rcx + 2147450880]
test dl, dl
jne .LBB2_7
.LBB2_8:
mov qword ptr [rbx + 16], rax # 8-byte Spill
mov dword ptr [r14 + 16], 0
Which clang-11 has, and clang-12 doesn't. From the looks of it, the address sanitizer tries to check that r12+40 (which should be the promise's cleanup method) is initialized before initializing it.
Clang-12 just performs no checks for the promise, leaving the entirity of the code above out.
TL;DR: (probably) a bug in clang-11 coroutine sanitation, fixed in 12.0, perhaps in later versions of clang-11 as well.

Multithreading program stuck in optimized mode but runs normally in -O0

I wrote a simple multithreading programs as follows:
static bool finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.
Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.
My fix:
#include <iostream>
#include <future>
#include <atomic>
static std::atomic<bool> finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Output:
result =1023045342
main thread id=140147660588864
Live Demo on coliru
Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)
But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.
Making a bool unguarded, non-atomic can cause additional issues:
The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)
To prevent this to happen, the compiler must be told explicitly not to do.
I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:
Is volatile useful with threads
Who's afraid of a big bad optimizing compiler?.
Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.
I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:
func():
cmp BYTE PTR finished[rip], 0
jne .L4
.L5:
jmp .L5
.L4:
mov eax, 0
ret
So, what is happening here?
First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.
If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.
If it is false however, we move to
.L5:
jmp .L5
This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.
In other words, the thread is put into an infinite busy loop.
So why has this happened?
As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.
Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.
The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.
at -O0 the compiler (as expected) does not optimise the loop body and comparison away:
func():
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 0
.L148:
movzx eax, BYTE PTR finished[rip]
test al, al
jne .L147
add QWORD PTR [rbp-8], 1
jmp .L148
.L147:
mov rax, QWORD PTR [rbp-8]
pop rbp
ret
therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.
A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.
For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.
Here is an example:
class ST {
public:
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
void setFinished(bool val)
{
finished = val;
}
private:
std::atomic<bool> finished = false;
};
int main()
{
ST st;
auto result=std::async(std::launch::async, &ST::func, std::ref(st));
std::this_thread::sleep_for(std::chrono::seconds(1));
st.setFinished(true);
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Live on wandbox

C++ variable declaration re: memory usage

Hi I know this is a very silly/basic question, but what is the difference between the code like:-
int *i;
for(j=0;j<10;j++)
{
i = static_cast<int *>(getNthCon(j));
i->xyz
}
and, some thing like this :-
for(j=0;j<10;j++)
{
int *i = static_cast<int *>(getNthCon(j));
i->xyz;
}
I mean, are these code extremely same in logic, or would there be any difference due to its local nature ?
One practical difference is the scope of i. In the first case, i continues to exist after the final iteration of the loop. In the second it does not.
There may be some case where you want to know the value of i after all of the computation is complete. In that case, use the second pattern.
A less practical difference is the nature of the = token in each case. In the first example i = ... indicates assignment. In the second example, int *i = ... indicates initialization. Some types (but not int* nor fp_ContainerObject*) might treat assignment and initialization differently.
There is very little difference between them.
In the first code sample, i is declared outside the loop, so you're re-using the same pointer variable on each iteration. In the second, i is local to the body of the loop.
Since i is not used outside the loop, and the value assigned to it in one iteration is not used in future iterations, it's better style to declare it locally, as in the second sample.
Incidentally, i is a bad name for a pointer variable; it's usually used for int variables, particularly ones used in for loops.
For any sane optimizing compiler there will be no difference in terms of memory allocation. The only difference will be the scope of i. Here is a sample program (and yes, I realize there is a leak here):
#include <iostream>
int *get_some_data(int value) {
return new int(value);
}
int main(int argc, char *argv[]){
int *p;
for(int i = 0; i < 10; ++i) {
p = get_some_data(i);
std::cout << *p;
}
return 0;
}
And the generated assembly output:
int main(int argc, char *argv[]){
01091000 push esi
01091001 push edi
int *p;
for(int i = 0; i < 10; ++i) {
01091002 mov edi,dword ptr [__imp_operator new (10920A8h)]
01091008 xor esi,esi
0109100A lea ebx,[ebx]
p = get_some_data(i);
01091010 push 4
01091012 call edi
01091014 add esp,4
01091017 test eax,eax
01091019 je main+1Fh (109101Fh)
0109101B mov dword ptr [eax],esi
0109101D jmp main+21h (1091021h)
0109101F xor eax,eax
std::cout << *p;
01091021 mov eax,dword ptr [eax]
01091023 mov ecx,dword ptr [__imp_std::cout (1092048h)]
01091029 push eax
0109102A call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (1092044h)]
01091030 inc esi
01091031 cmp esi,0Ah
01091034 jl main+10h (1091010h)
}
Now with the pointer declared inside of the loop:
int main(int argc, char *argv[]){
008D1000 push esi
008D1001 push edi
for(int i = 0; i < 10; ++i) {
008D1002 mov edi,dword ptr [__imp_operator new (8D20A8h)]
008D1008 xor esi,esi
008D100A lea ebx,[ebx]
int *p = get_some_data(i);
008D1010 push 4
008D1012 call edi
008D1014 add esp,4
008D1017 test eax,eax
008D1019 je main+1Fh (8D101Fh)
008D101B mov dword ptr [eax],esi
008D101D jmp main+21h (8D1021h)
008D101F xor eax,eax
std::cout << *p;
008D1021 mov eax,dword ptr [eax]
008D1023 mov ecx,dword ptr [__imp_std::cout (8D2048h)]
008D1029 push eax
008D102A call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (8D2044h)]
008D1030 inc esi
008D1031 cmp esi,0Ah
008D1034 jl main+10h (8D1010h)
}
As you can see, the output is identical. Note that, even in a debug build, the assembly remains identical.
Ed S. shows that most compilers will generate the same code for both cases. But, as Mahesh points out, they're not actually identical (even beyond the obvious fact that it would be legal to use i outside the loop scope in version 1 but not version 2). Let me try to explain how these can both be true, in a way that isn't misleading.
First, where does the storage for i come from?
The standard is silent on this—as long as storage is available for the entire lifetime of the scope of i, it can be anywhere the compiler likes. But the typical way to deal with local variables (technically, variables with automatic storage duration) is to expand the stack frame of the appropriate scope by sizeof(i) bytes, and store i as an offset into that stack frame.
A "teaching compiler" might always create a stack frame for each scope. But a real compiler usually won't bother, especially if nothing happens on entering and exiting the loop scope. (There's no way you can tell the difference, except by looking at the assembly or breaking in with a debugger, so of course it's allowed to do this.) So, both versions will probably end up with i referring to the exact same offset from the function's stack frame. (Actually, it's quite plausible i will end up in a register, but that doesn't change anything important here.)
Now let's look at the lifecycle.
In the first case, the compiler has to default-initialize i where it's declared at the function scope, copy-assign into it each time through the loop, and destroy it at the end of the function scope. In the second case, the compiler has to copy-initialize i at the start of each loop, and destroy it at the end of each loop. Like this:
If i were of class type, this would be a very significant difference. (See below if it's not obvious why.) But it's not, it's a pointer. This means default initialization and destruction are both no-ops, and copy-initialization and copy-assignment are identical.
So, the lifecycle-management code will be identical in both cases: it's a copy once each time through the loop, and that's it.
In other words, the storage is allowed to be, and probably will be, the same; the lifecycle management is required to be the same.
I promised to come back to why these would be different if i were of class type.
Compare this pseudocode:
i.IType();
for(j=0;j<10;j++) {
i.operator=(static_cast<IType>(getNthCon(j));
}
i.~IType();
to this:
for(j=0;j<10;j++) {
i.IType(static_cast<IType>(getNthCon(j));
i.~IType();
}
At first glance, the first version looks "better", because it's 1 IType(), 10 operator=(IType&), and 1 ~IType(), while the second is 10 IType(IType&) and 10 ~IType(). And for some classes, this might be true. But if you think about how operator= works, it usually has to do at least the equivalent of a copy construction and a destruction.
So the real difference here is that the first version requires a default constructor and a copy-assignment operator, while the second doesn't. And if you take out that static_cast bit (so we're talking about a conversion constructor and assignment instead of copy), what you're looking at is equivalent to this:
for(j=0;j<10;j++) {
std::ifstream i(filenames[j]);
}
Clearly, you would try to pull i out of the loop in that case.
But again, this is only true for "most" classes; you can easily design a class for which version 2 is ridiculously bad and version 1 makes more sense.
For every iteration, in second case, a new pointer variable is created on the stack. While in the first case, the pointer variable is created only once(i.e., before entering the loop )

How is static variable initialization implemented by the compiler?

I'm curious about the underlying implementation of static variables within a function.
If I declare a static variable of a fundamental type (char, int, double, etc.), and give it an initial value, I imagine that the compiler simply sets the value of that variable at the very beginning of the program before main() is called:
void SomeFunction();
int main(int argCount, char ** argList)
{
// at this point, the memory reserved for 'answer'
// already contains the value of 42
SomeFunction();
}
void SomeFunction()
{
static int answer = 42;
}
However, if the static variable is an instance of a class:
class MyClass
{
//...
};
void SomeFunction();
int main(int argCount, char ** argList)
{
SomeFunction();
}
void SomeFunction()
{
static MyClass myVar;
}
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
static bool initialized = 0;
if (!initialized)
{
// construct myVar
initialized = 1;
}
This question covered similar ground, but thread safety wasn't mentioned. For what it's worth, C++0x will make function static initialisation thread safe.
(see the C++0x FCD, 6.7/4 on function statics: "If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
completion of the initialization.")
One other thing that hasn't been mentioned is that function statics are destructed in reverse order of their construction, so the compiler maintains a list of destructors to call on shutdown (this may or may not be the same list that atexit uses).
In the compiler output I have seen, function local static variables are initialized exactly as you imagine.
(Caveat: This paragraph applies to C++ versions older than C++11. See the comments for changes since C++11.) Note that in general this is not done in a thread-safe manner. So if you have functions with static locals like that that might be called from multiple threads, you should take this into account. Calling the function once in the main thread before any others are called will usually do the trick.
I should add that if the initialization of the local static is by a simple constant like in your example, the compiler doesn't need to go through these gyrations - it can just initialize the variable in the image or before main() like a regular static initialization (because your program wouldn't be able to tell the difference). But if you initialize it with a function's return value, then the compiler pretty much has to test a flag indicating if the initialization has been done or something equivalent.
You're right about everything, including the initialized flag as a common implementation. This is basically why initialization of static locals is not thread-safe, and why pthread_once exists.
One slight caveat: the compiler must emit code which "behaves as if" the static local variable is constructed the first time it is used. Since integer initialization has no side effects (and calls no user code), it's up to the compiler when it initializes the int. User code cannot "legitimately" find out what it does.
Obviously you can look at the assembly code, or provoke undefined behaviour and make deductions from what actually happens. But the C++ standard doesn't count that as valid grounds to claim that the behaviour is not "as if" it did what the spec says.
I know that it will not be initialized until the first time that the function is called. Since the compiler has no way of knowing when the function will be called for the first time, how does it produce this behavior? Does it essentially introduce an if-block into the function body?
Yes, that's right: and, FWIW, it's not necessarily thread-safe (if the function is called "for the first time" by two threads simultaneously).
For that reason you might prefer to define the variable at global scope (although maybe in a class or namespace, or static without external linkage) instead of inside a function, so that it's initialized before the program starts without any run-time "if".
Another twist is in embedded code, where the run-before-main() code (cinit/whatever) may copy pre-initialized data (both statics and non-statics) into ram from a const data segment, perhaps residing in ROM. This is useful where the code may not be running from some sort of backing store (disk) where it can be re-loaded from. Again, this doesn't violate the requirements of the language, since this is done before main().
Slight tangent: While I've not seen it done much (outside of Emacs), a program or compiler could basically run your code in a process and instantiate/initialize objects, then freeze and dump the process. Emacs does something similar to this to load up large amounts of elisp (i.e. chew on it), then dump the running state as the working executable, to avoid the cost of parsing on each invocation.
The relevant thing isn't being a class type or not, it's compile-time evaluation of the initializer (at the current optimization level). And of course the constructor not having any side-effects, if it's non-trivial.
If it's not possible to simply put a constant value in .data, gcc/clang use an acquire load of a guard variable to check that static locals have been initialized. If the guard variable is false, then they pick one thread to do the initializing, and have other threads wait for it if they also see a false guard variable. They've been doing this for a long time, since before C++11 required it. (e.g. as old as GCC4.1 on Godbolt, from May 2006.)
Does a function local static variable automatically incur a branch? shows what GCC does.
Cost of thread-safe local static variable initialization in C++11? same
Why does initialization of local static objects use hidden guard flags? same
The most simple artificial example, snapshotting the arg from the first call and ignoring later args:
int foo(int a){
static int x = a;
return x;
}
Compiles for x86-64 with GCC11.3 -O3 (Godbolt), with the exact same asm generated for -std=gnu++03 mode. GCC4.1 also makes about the same asm, but doesn't keep the push/pop off the fast path (i.e. missing shrink-wrap optimization). GCC4.1 only supported AT&T syntax output, so it visually looks different unless you flip modern GCC to AT&T mode as well, but this is Intel syntax (destination on the left).
# demangled asm from g++ -O3
foo(int):
movzx eax, BYTE PTR guard variable for foo(int)::x[rip] # guard.load(acquire)
test al, al
je .L13
mov eax, DWORD PTR foo(int)::x[rip] # normal load of the static local
ret # fast path through the function is the already-initialized case
.L13: # jumps here on guard == 0, on the first call (and any that race with it)
# It would be sensible for GCC to put this code in .text.cold
push rbx
mov ebx, edi # save function arg in a call-preserved reg
mov edi, OFFSET FLAT:guard variable for foo(int)::x # address
call __cxa_guard_acquire # guard_acquire(&guard_x) presumably a normal mutex or spinlock
test eax, eax
jne .L14 # if (we won the race to do the init work) goto .L14
mov eax, DWORD PTR foo(int)::x[rip] # else it's done now by another thread
pop rbx
ret
.L14:
mov edi, OFFSET FLAT:guard variable for foo(int)::x
mov DWORD PTR foo(int)::x[rip], ebx # init static x (from a saved in RBX)
call __cxa_guard_release
mov eax, DWORD PTR foo(int)::x[rip] # missed optimization: mov eax, ebx
# This thread is the one that just initialized it, our function arg is the value.
# It's not atomic (or volatile), so another thread can't have set it, too.
pop rbx
ret
If compiling for AArch64, the load of the guard variable is ldarb w8, [x8], a load with acquire semantics. Other ISAs might need a plain load and then a barrier to give at least LoadLoad ordering, to make sure they load the payload x no earlier than when they saw the guard variable being non-zero.
If the static variable has a constant initializer, no guard is needed
int bar(int a){
static int x = 1;
return ++x + a;
}
bar(int):
mov eax, DWORD PTR bar(int)::x[rip]
add eax, 1
mov DWORD PTR bar(int)::x[rip], eax # store the updated value
add eax, edi # and add it to the function arg
ret
.section .data
bar(int)::x:
.long 1