Multithreading program stuck in optimized mode but runs normally in -O0 - c++

I wrote a simple multithreading programs as follows:
static bool finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
It behaves normally in debug mode in Visual studio or -O0 in gcc and print out the result after 1 seconds. But it stuck and does not print anything in Release mode or -O1 -O2 -O3.

Two threads, accessing a non-atomic, non-guarded variable are U.B. This concerns finished. You could make finished of type std::atomic<bool> to fix this.
My fix:
#include <iostream>
#include <future>
#include <atomic>
static std::atomic<bool> finished = false;
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
int main()
{
auto result=std::async(std::launch::async, func);
std::this_thread::sleep_for(std::chrono::seconds(1));
finished=true;
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Output:
result =1023045342
main thread id=140147660588864
Live Demo on coliru
Somebody may think 'It's a bool – probably one bit. How can this be non-atomic?' (I did when I started with multi-threading myself.)
But note that lack-of-tearing is not the only thing that std::atomic gives you. It also makes concurrent read+write access from multiple threads well-defined, stopping the compiler from assuming that re-reading the variable will always see the same value.
Making a bool unguarded, non-atomic can cause additional issues:
The compiler might decide to optimize variable into a register or even CSE multiple accesses into one and hoist a load out of a loop.
The variable might be cached for a CPU core. (In real life, CPUs have coherent caches. This is not a real problem, but the C++ standard is loose enough to cover hypothetical C++ implementations on non-coherent shared memory where atomic<bool> with memory_order_relaxed store/load would work, but where volatile wouldn't. Using volatile for this would be UB, even though it works in practice on real C++ implementations.)
To prevent this to happen, the compiler must be told explicitly not to do.
I'm a little bit surprised about the evolving discussion concerning the potential relation of volatile to this issue. Thus, I'd like to spent my two cents:
Is volatile useful with threads
Who's afraid of a big bad optimizing compiler?.

Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.
I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:
func():
cmp BYTE PTR finished[rip], 0
jne .L4
.L5:
jmp .L5
.L4:
mov eax, 0
ret
So, what is happening here?
First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.
If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.
If it is false however, we move to
.L5:
jmp .L5
This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.
In other words, the thread is put into an infinite busy loop.
So why has this happened?
As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.
Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.
The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.
at -O0 the compiler (as expected) does not optimise the loop body and comparison away:
func():
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 0
.L148:
movzx eax, BYTE PTR finished[rip]
test al, al
jne .L147
add QWORD PTR [rbp-8], 1
jmp .L148
.L147:
mov rax, QWORD PTR [rbp-8]
pop rbp
ret
therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.
A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.

For the sake of completeness in the learning curve; you should avoid using global variables. You did a good job though by making it static, so it will be local to the translation unit.
Here is an example:
class ST {
public:
int func()
{
size_t i = 0;
while (!finished)
++i;
return i;
}
void setFinished(bool val)
{
finished = val;
}
private:
std::atomic<bool> finished = false;
};
int main()
{
ST st;
auto result=std::async(std::launch::async, &ST::func, std::ref(st));
std::this_thread::sleep_for(std::chrono::seconds(1));
st.setFinished(true);
std::cout<<"result ="<<result.get();
std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}
Live on wandbox

Related

Coroutine frame seems to be prematurely marked as destroyed when using address sanitizer

I am trying to write a small and simple coroutine library just to get a more solid understanding of C++20 coroutines. It seems to work fine, but when I compile with clang's adress sanitizer, it throws up on me.
I have narrowed down the issue to the following code example (available with compiler and sanitizer output at https://godbolt.org/z/WqY6Gd), but I still can't make any sense of it.
// namespace coro = std::/std::experimental;
// inlining this suppresses the error
__attribute__((noinline)) void foo(int& i) { i = 0; }
struct task {
struct promise_type {
promise_type() = default;
coro::suspend_always initial_suspend() noexcept { return {}; }
coro::suspend_always final_suspend() noexcept { return {}; }
void unhandled_exception() noexcept { std::terminate(); }
void return_value(int v) noexcept { value = v; }
task get_return_object() {
return task{coro::coroutine_handle<promise_type>::from_promise(*this)};
}
int value{};
};
void Start() { return handle_.resume(); }
int Get() {
auto& promise = handle_.promise();
return promise.value;
}
coro::coroutine_handle<promise_type> handle_;
};
task func() { co_return 3; }
int main() {
auto t = func();
t.Start();
const auto result = t.Get();
foo(t.handle_.promise().value);
// moving this one line down or separating this into a noinline
// function suppresses the error
// removing this removes the stack-use-after-scope, but (rightfully) reports a leak
t.handle_.destroy();
if (result != 3) return 1;
}
Address sanitizer reports use-after-scope (full output available at godbolt, link above).
With some help from lldb, I found out that the error is thrown in main, more precisely: the jump at line 112 in the assembly listing, jne .LBB2_15, jumps to asan's report and never returns. It seems to be inside main's prologue.
As the comments indicate, moving destroy() a line down or calling it in a separate noinline function1 changes the behavior of address sanitizer. The only two explanations to this seem to be undefined behavior and asan throwing a false positive (or -fsanitize=address itself is creating lifetime issues, which is sort of the same in a sense).
At this point I'm fairly certain that there's no UB in the code above: both task and result live on main's stack frame, the promise object lives in the coroutine frame. The frame itself is allocated (on main's stack because no suspend-points) at line 1 of main, and destroyed right before returning, past the last access to it in foo(). The coroutine frame is not destroyed automatically because control never flows off co_await final_suspend(), as per the standard. I've been staring at this code for a while, though, so please forgive me if I missed something obvious.
The assembly generated without sanitation seems to makes sense to me and all the memory access happens within [rsp, rsp+24], as allocated. Futhermore, compiling with -fsanitize=address,undefined, or just -fsanitize=undefined, or simply compiling with gcc with -fsanitize=address reports no errors, which leads me to believe the issue is hidden somewhere in the code generated by asan.
Unfortunately, I can't quite make sense of what exactly happens in the code instrumented by asan, and that's why I'm posting this. I have a general understanding of Address sanitizer's algorithm, but I can't map the assembly memory access/allocations to what's happenning in the C
++ code.
I'm hoping that an answer will help me
Understand where the lifetime issues are hidden, if there are any
Understand what exactly happens in main when compiled with asan, so that a person reading this can have a more clear way of finding what memory access in the C++ code triggered the error, and where (if anywhere) was that memory allocated and freed.
Consistently suppress this particular false positive, and elaborate a bit on what causes it, if the issue really is in asan.
Thanks in advance.
1 This initially lead me to believe that clang's optimizer is reading result from the (destroyed) coroutine's frame directly, but moving destroy() into task's destructor brings the issue back and proves that theory wrong, as far as I can tell. destroy() is not in the destructor in the listing above because it requires implementing move construction/assignment in order to avoid double free, and I wanted to keep the example as small and clear as possible.
I think I figured it out - but mostly because it's already fixed in clang12.0.
Running the smaller/cleaner example with clang-12 shows no error from asan. The difference is in the following lines:
movabs rcx, -866669180174077455
mov qword ptr [r13 + 2147450880], rcx
mov dword ptr [r13 + 2147450888], -202116109
lea rdi, [r12 + 40]
mov rcx, rdi
shr rcx, 3
cmp byte ptr [rcx + 2147450880], 0
jne .LBB2_14
lea r14, [r12 + 32]
mov qword ptr [r14 + 8], offset f() [clone .cleanup]
lea rdi, [r14 + 16]
mov byte ptr [r13 + 2147450884], 0
mov rcx, rdi
shr rcx, 3
mov dl, byte ptr [rcx + 2147450880]
test dl, dl
jne .LBB2_7
.LBB2_8:
mov qword ptr [rbx + 16], rax # 8-byte Spill
mov dword ptr [r14 + 16], 0
Which clang-11 has, and clang-12 doesn't. From the looks of it, the address sanitizer tries to check that r12+40 (which should be the promise's cleanup method) is initialized before initializing it.
Clang-12 just performs no checks for the promise, leaving the entirity of the code above out.
TL;DR: (probably) a bug in clang-11 coroutine sanitation, fixed in 12.0, perhaps in later versions of clang-11 as well.

Busy polling std::atomic - msvc optimizes loop away - why, and how to prevent?

I'm trying to implement a simple busy loop function.
This should keep polling a std::atomic variable for a maximum number of times (spinCount), and return true if the status did change (to anything other than NOT_AVAILABLE) within the given tries, or false otherwise:
// noinline is just to be able to inspect the resulting ASM a bit easier - in final code, this function SHOULD be inlined!
__declspec(noinline) static bool trySpinWait(std::atomic<Status>* statusPtr, const int spinCount)
{
int iSpinCount = 0;
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
return iSpinCount == spinCount;
}
However, it seems that MSVC just opitmizes the loop away on Release mode for Win64. I'm pretty bad with Assembly, but doesn't look to me like it's ever even trying to read the value of statusPtr at all:
int iSpinCount = 0;
000000013F7E2040 xor eax,eax
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
000000013F7E2042 inc eax
000000013F7E2044 cmp eax,edx
000000013F7E2046 jge trySpinWait+12h (013F7E2052h)
000000013F7E2048 mov r8d,dword ptr [rcx]
000000013F7E204B test r8d,r8d
000000013F7E204E je trySpinWait+2h (013F7E2042h)
return iSpinCount == spinCount;
000000013F7E2050 cmp eax,edx
000000013F7E2052 sete al
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this, but seems that's not the case (or rather, my understanding was probably wrong).
What am I doing wrong here, or rather - how can I best implement that loop without having it optimized away, with least impact on overall performance?
I know I could use #pragma optimize( "", off ), but (other than in the example above), in my final code I'd very much like to have this call inlined into a larger function for performance reasons. seems that this #pragma will generally prevent inlining though.
Appreciate any thoughts!
Thanks
but doesn't look to me like it's ever even trying to read the value of statusPtr at all
It does reload it on every iteration of the loop:
000000013F7E2048 mov r8d,dword ptr [rcx] # rcx is statusPtr
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this,
You do not need anything more than std::memory_order_relaxed here because there is only one variable shared between threads (even more, this code doesn't change the value of the atomic variable). There are no reordering concerns.
In other words, this function works as expected.
You may like to use PAUSE instruction, see Benefitting Power and Performance Sleep Loops.

Double-check locking issues, c++

I left the rest of implementation for simplicity because it is not relevant here.
Consider the classical implemetation of Double-check loking descibed in Modern C++ Design.
Singleton& Singleton::Instance()
{
if(!pInstance_)
{
Guard myGuard(lock_);
if (!pInstance_)
{
pInstance_ = new Singleton;
}
}
return *pInstance_;
}
Here the author insists that we avoid the race condition. But I have read an article, which unfortunately I dont remeber very well, in which the following flow was described.
Thread 1 enters first if statement
Thread 1 enters the mutex end get in the second if body.
Thread 1 calls operator new and assigns memory to pInstance than calls a constructor on that memory;
Suppose the thread 1 assigned the memory to pInstance but not created the object and thread 2 enters the function.
Thread 2 see that the pInstance is not null (but yet not initialized with constructor) and returns the pInstance.
In that article the author stated then the trick is that on the line pInstance_ = new Singleton; the memory can be allocated, assigned to pInstance that the constructor will be called on that memory.
Relying to standard or other reliable sources, can anyone please confirm or deny the probability or correctness of this flow? Thanks!
The problem you describe can only occur if for reasons I cannot imagine the conceptors of the singleton uses an explicit (and broken) 2 steps construction:
...
Guard myGuard(lock_);
if (!pInstance_)
{
auto alloc = std::allocator<Singleton>();
pInstance_ = alloc.allocate(); // SHAME here: race condition
// eventually other stuff
alloc.construct(_pInstance); // anything could have happened since allocation
}
....
Even if for any reason such a 2 step construction was required, the _pInstance member shall never contain anything else that nullptr or a fully constructed instance:
auto alloc = std::allocator<Singleton>();
Singleton *tmp = alloc.allocate(); // no problem here
// eventually other stuff
alloc.construct(tmp); // nor here
_pInstance = tmp; // a fully constructed instance
But beware: the fix is only guaranteed on a mono CPU. Things could be much worse on multi core systems where the C++11 atomics semantics are indeed required.
The problem is that in the absence of guarantees otherwise, the store of the pointer into pInstance_ might be seen by some other thread before the construction of the object is complete. In that case, the other thread won't enter the mutex and will just immediately return pInstance_ and when the caller uses it, it can see uninitialized values.
This apparent reordering between the store(s) associated with the construction on Singleton and the store to pInstance_ may be caused by compiler or the hardware. I'll take a quick look at both cases below.
Compiler Reordering
Absent any specific guarantees guarantees related to concurrent reads (such as those offered by C++11's std::atomic objects) the compiler only needs to preserve the semantics of the code as seen by the current thread. That means, for example, that it may compile code "out of order" to how it appears in the source, as long as this doesn't have visible side-effects (as defined by the standard) on the current thread.
In particular, it would not be uncommon for the compiler to reorder stores performed in the constructor for Singleton, with the store to pInstance_, as long as it can see that the effect is the same1.
Let's take a look at a fleshed out version of your example:
struct Lock {};
struct Guard {
Guard(Lock& l);
};
int value;
struct Singleton {
int x;
Singleton() : x{value} {}
static Lock lock_;
static Singleton* pInstance_;
static Singleton& Instance();
};
Singleton& Singleton::Instance()
{
if(!pInstance_)
{
Guard myGuard(lock_);
if (!pInstance_)
{
pInstance_ = new Singleton;
}
}
return *pInstance_;
}
Here, the constructor for Singleton is very simple: it simply reads from the global value and assigns it to the x, the only member of Singleton.
Using godbolt, we can check exactly how gcc and clang compile this. The gcc version, annotated, is shown below:
Singleton::Instance():
mov rax, QWORD PTR Singleton::pInstance_[rip]
test rax, rax
jz .L9 ; if pInstance != NULL, go to L9
ret
.L9:
sub rsp, 24
mov esi, OFFSET FLAT:_ZN9Singleton5lock_E
lea rdi, [rsp+15]
call Guard::Guard(Lock&) ; acquire the mutex
mov rax, QWORD PTR Singleton::pInstance_[rip]
test rax, rax
jz .L10 ; second check for null, if still null goto L10
.L1:
add rsp, 24
ret
.L10:
mov edi, 4
call operator new(unsigned long) ; allocate memory (pointer in rax)
mov edx, DWORD value[rip] ; load value global
mov QWORD pInstance_[rip], rax ; store pInstance pointer!!
mov DWORD [rax], edx ; store value into pInstance_->x
jmp .L1
The last few lines are critical, in particular the two stores:
mov QWORD pInstance_[rip], rax ; store pInstance pointer!!
mov DWORD [rax], edx ; store value into pInstance_->x
Effectively, the line pInstance_ = new Singleton; been transformed into:
Singleton* stemp = operator new(sizeof(Singleton)); // (1) allocate uninitalized memory for a Singleton object on the heap
int vtemp = value; // (2) read global variable value
pInstance_ = stemp; // (3) write the pointer, still uninitalized, into the global pInstance (oops!)
pInstance_->x = vtemp; // (4) initialize the Singleton by writing x
Oops! Any second thread arriving when (3) has occurred, but (4) hasn't, will see a non-null pInstance_, but then read an uninitialized (garbage) value for pInstance->x.
So even without invoking any weird hardware reordering at all, this pattern isn't safe without doing more work.
Hardware Reordering
Let's say you organize so that the reordering of the stores above doesn't occur on your compiler2, perhaps by putting a compiler barrier such as asm volatile ("" ::: "memory"). With that small change, gcc now compiles this to have the two critical stores in the "desired" order:
mov DWORD PTR [rax], edx
mov QWORD PTR Singleton::pInstance_[rip], rax
So we're good, right?
Well on x86, we are. It happens that x86 has a relatively strong memory model, and all stores already have release semantics. I won't describe the full semantics, but in the context of two stores as above, it implies that stores appear in program order to other CPUs: so any CPU that sees the second write above (to pInstance_) will necessarily see the prior write (to pInstance_->x).
We can illustrate that by using the C++11 std::atomic feature to explicitly ask for a release store for pInstance_ (this also enables us to get rid of the compiler barrier):
static std::atomic<Singleton*> pInstance_;
...
if (!pInstance_)
{
pInstance_.store(new Singleton, std::memory_order_release);
}
We get reasonable assembly with no hardware memory barriers or anything (there is a redundant load now, but this is both a missed-optimization by gcc and a consequence of the way we wrote the function).
So we're done, right?
Nope - most other platforms don't have the strong store-to-store ordering that x86 does.
Let's take a look at ARM64 assembly around the creation of the new object:
bl operator new(unsigned long)
mov x1, x0 ; x1 holds Singleton* temp
adrp x0, .LANCHOR0
ldr w0, [x0, #:lo12:.LANCHOR0] ; load value
str w0, [x1] ; temp->x = value
mov x0, x1
str x1, [x19, #pInstance_] ; pInstance_ = temp
So we have the str to pInstance_ as the last store, coming after the temp->x = value store, as we want. However, the ARM64 memory model doesn't guarantee that these stores appear in program order when observed by another CPU. So even though we've tamed the compiler, the hardware can still trip us up. You'll need a barrier to solve this.
Prior to C++11, there wasn't a portable solution for this problem. For a particular ISA you could use inline assembly to emit the right barrier. Your compiler might have a builtin like __sync_synchronize offered by gcc, or your OS might even have something.
In C++11 and beyond, however, we finally have a formal memory model built-in to the language, and what we need there, for doubled check locking is a release store, as the final store to pInstance_. We saw this already for x86 where we checked that no compiler barrier was emitted, using std::atomic with memory_order_release the object publishing code becomes:
bl operator new(unsigned long)
adrp x1, .LANCHOR0
ldr w1, [x1, #:lo12:.LANCHOR0]
str w1, [x0]
stlr x0, [x20]
The main difference is the final store is now stlr - a release store. You can check out the PowerPC side too, where an lwsync barrier has popped up between the two stores.
So the bottom line is that:
Double checked locking is safe in a sequentially consistent system.
Real-world systems almost always deviate from sequential consistency, either due to the hardware, the compiler or both.
To solve that, you need tell the compiler what you want, and it will both avoid reordering itself and emit the necessary barrier instructions, if any, to prevent the hardware from causing a problem.
Prior to C++11, the "way you tell the compiler" to do that was platform/compiler/OS specific, but in C++ you can simply use std::atomic with memory_order_acquire loads and memory_order_release stores.
The Load
The above only covered half of the problem: the store of pInstance_. The other half that can go wrong is the load, and the load is actually the most important for performance, since it represents the usual fast-path that gets taken after the singleton is initialized. What if the pInstance_->x was loaded before pInstance itself was loaded and checked for null? In that case, you could still read an uninitialized value!
This might seem unlikely, since pInstance_ needs to be loaded before it is deferenced, right? That is, there seems to be a fundamental dependency between the operations that prevents reordering, unlike the store case. Well, as it turns out, both hardware behavior and software transformation could still trip you up here, and the details are even more complex than the store case. If you use memory_order_acquire though, you'll be fine. If you want the last once of performance, especially on PowerPC, you'll need to dig into the details of memory_order_consume. A tale for another day.
1 In particular, this means that the compiler has to be able to see the code for the constructor Singleton() so that it can determine that it doesn't read from pInstance_.
2 Of course, it's very dangerous to rely on this since you'd have to check the assembly after every compilation if anything changed!
It used to be unspecified before C++11, because there was no standard memory model discussing multiple threads.
IIRC the pointer could have been set to the allocated address before the constructor completed so long as that thread would never be able to tell the difference (this could probably only happen for a trivial/non-throwing constructor).
Since C++11, the sequenced-before rules disallow that reordering, specifically
8) The side effect (modification of the left argument) of the built-in assignment operator ... is sequenced after the value computation ... of both left and right arguments, ...
Since the right argument is a new-expression, that must have completed allocation & construction before the left-hand-side can be modified.

Are function-local static mutexes thread-safe?

In the following program I attempt the make the print function thread-safe by using a function-local mutex object:
#include <iostream>
#include <chrono>
#include <mutex>
#include <string>
#include <thread>
void print(const std::string & s)
{
// Thread safe?
static std::mutex mtx;
std::unique_lock<std::mutex> lock(mtx);
std::cout <<s << std::endl;
}
int main()
{
std::thread([&](){ for (int i = 0; i < 10; ++i) print("a" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("b" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("c" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("d" + std::to_string(i)); }).detach();
std::thread([&](){ for (int i = 0; i < 10; ++i) print("e" + std::to_string(i)); }).detach();
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
Is this safe?
My doubts arise from this question, which presents a similar case.
C++11
In C++11 and later versions: yes, this pattern is safe. In particular, initialization of function-local static variables is thread-safe, so your code above works safely across threads.
This way this works in practice is that the compiler inserts any necessary boilerplate in the function itself to check if the variable is initialized prior to access. In the case of std::mutex as implemented in gcc, clang and icc, however, the initialized state is all-zeros, so no explicit initialization is needed (the variable will live in the all-zeros .bss section so the initialization is "free"), as we see from the assembly1:
inc(int& i):
mov eax, OFFSET FLAT:_ZL28__gthrw___pthread_key_createPjPFvPvE
test rax, rax
je .L2
push rbx
mov rbx, rdi
mov edi, OFFSET FLAT:_ZZ3incRiE3mtx
call _ZL26__gthrw_pthread_mutex_lockP15pthread_mutex_t
test eax, eax
jne .L10
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:_ZZ3incRiE3mtx
pop rbx
jmp _ZL28__gthrw_pthread_mutex_unlockP15pthread_mutex_t
.L2:
add DWORD PTR [rdi], 1
ret
.L10:
mov edi, eax
call _ZSt20__throw_system_errori
Note that starting at the line mov edi, OFFSET FLAT:_ZZ3incRiE3mtx it simply loads the address of the inc::mtx function-local static and calls pthread_mutex_lock on it, without any initialization. The code before that dealing with pthread_key_create is apparently just checking if the pthreads library is present at all.
There's not guarantee, however, that all implementations will implement std::mutex as all-zeros, so you might in some cases incur ongoing overhead on each call to check if the mutex has been initialized. Declaring the mutex outside the function would avoid that.
Here's an example contrasting the two approaches with a stand-in mutex2 class with a non-inlinable constructor (so the compiler can't determine that the initial state is all-zeros):
#include <mutex>
class mutex2 {
public:
mutex2();
void lock();
void unlock();
};
void inc_local(int &i)
{
// Thread safe?
static mutex2 mtx;
std::unique_lock<mutex2> lock(mtx);
i++;
}
mutex2 g_mtx;
void inc_global(int &i)
{
std::unique_lock<mutex2> lock(g_mtx);
i++;
}
The function-local version compiles (on gcc) to:
inc_local(int& i):
push rbx
movzx eax, BYTE PTR _ZGVZ9inc_localRiE3mtx[rip]
mov rbx, rdi
test al, al
jne .L3
mov edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
call __cxa_guard_acquire
test eax, eax
jne .L12
.L3:
mov edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
call _ZN6mutex24lockEv
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
pop rbx
jmp _ZN6mutex26unlockEv
.L12:
mov edi, OFFSET FLAT:_ZZ9inc_localRiE3mtx
call _ZN6mutex2C1Ev
mov edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
call __cxa_guard_release
jmp .L3
mov rbx, rax
mov edi, OFFSET FLAT:_ZGVZ9inc_localRiE3mtx
call __cxa_guard_abort
mov rdi, rbx
call _Unwind_Resume
Note the large amount of boilerplate dealing with the __cxa_guard_* functions. First, a rip-relative flag byte, _ZGVZ9inc_localRiE3mtx2 is checked and if non-zero, the variable has already been initialized and we are done and fall into the fast-path. No atomic operations are needed because on x86, loads already have the needed acquire semantics.
If this check fails, we go to the slow path, which is essentially a form of double-checked locking: the initial check is not sufficient to determine that the variable needs initialization because two or more threads may be racing here. The __cxa_guard_acquire call does the locking and the second check, and may either fall through to the fast path as well (if another thread concurrently initialized the object), or may jump dwon to the actual initialization code at .L12.
Finally note that the last 5 instructions in the assembly aren't direct reachable from the function at all as they are preceded by an unconditional jmp .L3 and nothing jumps to them. They are there to be jumped to by an exception handler should the call to the constructor mutex2() throw an exception at some point.
Overall, we can say at the runtime cost of the first-access initialization is low to moderate because the fast-path only checks a single byte flag without any expensive instructions (and the remainder of the function itself usually implies at least two atomic operations for mutex.lock() and mutex.unlock(), but it comes at a significant code size increase.
Compare to the global version, which is identical except that initailization happens during global initialization rather than before first access:
inc_global(int& i):
push rbx
mov rbx, rdi
mov edi, OFFSET FLAT:g_mtx
call _ZN6mutex24lockEv
add DWORD PTR [rbx], 1
mov edi, OFFSET FLAT:g_mtx
pop rbx
jmp _ZN6mutex26unlockEv
The function is less than a third of the size without any initialization boilerplate at all.
Prior to C++11
Prior to C++11, however, this is generally not safe, unless your compiler makes some special guarantees about the way in which static locals are initialized.
Some time ago, while looking at a similar issue, I examined the assembly generated by Visual Studio for this case. The pseudocode for the generated assembly code for your print method looked something like this:
void print(const std::string & s)
{
if (!init_check_print_mtx) {
init_check_print_mtx = true;
mtx.mutex(); // call mutex() ctor for mtx
}
// ... rest of method
}
The init_check_print_mtx is a compiler generated global variable specific to this method which tracks whether the local static has been initialized. Note that inside the "one time" initialize block guarded by this variable, that the variable is set to true before the mutex is initialized.
I though this was silly since it ensures that other threads racing into this method will skip the initializer and use a uninitialized mtx - versus the alternative of possibly initializing mtx more than once - but in fact doing it this way allows you to avoid the infinite recursion issue that occurs if std::mutex() were to call back into print, and this behavior is in fact mandated by the standard.
Nemo above mentions that this has been fixed (more precisely, re-specified) in C++11 to require a wait for all racing threads, which would make this safe, but you'll need to check your own compiler for compliance. I didn't check if in fact the new spec includes this guarantee, but I wouldn't be at all surprised given that local statics were pretty much useless in multi-threaded environments without this (except perhaps for primitive values which didn't have any check-and-set behavior because they just referred directly to an already initialized location in the .data segment).
1 Note that I changed the print() function to a slightly simpler inc() function that just increments an integer in the locked region. This has the same locking structure and implications as the original, but avoids a bunch of code dealing with the << operators and std::cout.
2 Using c++filt this de-mangles to guard variable for inc_local(int&)::mtx.
This is not the same as the linked question for several reasons.
The linked question is not C++11, but yours is. In C++11 initialization of function-local static variables is always safe. Prior to C++11 it was only safe with some compilers e.g. GCC and Clang default to thread-safe initialization.
The linked question initializes the reference by calling a function, which is dynamic initialization and happens at run-time. The default constructor for std::mutex is constexpr so your static variable has constant initialization, i.e. the mutex can be initialized at compile-time (or link-time) so there is nothing to do dynamically at runtime. Even if multiple threads call the function concurrently there's nothing they actually need to do before using the mutex.
Your code is safe (assuming your compiler implements the C++11 rules correctly.)
As long as the mutex is static, yes.
Local, nonstatic would defintely NOT be safe. Unless all your threads use the same stack, which also means you've now invented memory where one cell can hold many different values at the same time, and are just waiting for the Nobel committee to notify you for the next Nobel prize.
You must have some sort of "global" (shared) memory space for mutexes.

Should std::atomic<int*>::load be doing a compare-and-swap loop?

Summary: I had expected that std::atomic<int*>::load with std::memory_order_relaxed would be close to the performance of just loading a pointer directly, at least when the loaded value rarely changes. I saw far worse performance for the atomic load than a normal load on Visual Studio C++ 2012, so I decided to investigate. It turns out that the atomic load is implemented as a compare-and-swap loop, which I suspect is not the fastest possible implementation.
Question: Is there some reason that std::atomic<int*>::load needs to do a compare-and-swap loop?
Background: I believe that MSVC++ 2012 is doing a compare-and-swap loop on atomic load of a pointer based on this test program:
#include <atomic>
#include <iostream>
template<class T>
__declspec(noinline) T loadRelaxed(const std::atomic<T>& t) {
return t.load(std::memory_order_relaxed);
}
int main() {
int i = 42;
char c = 42;
std::atomic<int*> ptr(&i);
std::atomic<int> integer;
std::atomic<char> character;
std::cout
<< *loadRelaxed(ptr) << ' '
<< loadRelaxed(integer) << ' '
<< loadRelaxed(character) << std::endl;
return 0;
}
I'm using a __declspec(noinline) function in order to isolate the assembly instructions related to the atomic load. I made a new MSVC++ 2012 project, added an x64 platform, selected the release configuration, ran the program in the debugger and looked at the disassembly. Turns out that both std::atomic<char> and std::atomic<int> parameters end up giving the same call to loadRelaxed<int> - this must be something the optimizer did. Here is the disassembly of the two loadRelaxed instantiations that get called:
loadRelaxed<int * __ptr64>
000000013F4B1790 prefetchw [rcx]
000000013F4B1793 mov rax,qword ptr [rcx]
000000013F4B1796 mov rdx,rax
000000013F4B1799 lock cmpxchg qword ptr [rcx],rdx
000000013F4B179E jne loadRelaxed<int * __ptr64>+6h (013F4B1796h)
loadRelaxed<int>
000000013F3F1940 prefetchw [rcx]
000000013F3F1943 mov eax,dword ptr [rcx]
000000013F3F1945 mov edx,eax
000000013F3F1947 lock cmpxchg dword ptr [rcx],edx
000000013F3F194B jne loadRelaxed<int>+5h (013F3F1945h)
The instruction lock cmpxchg is atomic compare-and-swap and we see here that the code for atomically loading a char, an int or an int* is a compare-and-swap loop. I also built this code for 32-bit x86 and that implementation is still based on lock cmpxchg.
Question: Is there some reason that std::atomic<int*>::load needs to do a compare-and-swap loop?
I do not believe that relaxed atomic loads require compare-and-swap. In the end this std::atomic implementation was not usable for my purpose, but I still wanted to have the interface, so I made my own std::atomic using MSVC's barrier intrinsics. This has better performance than the default std::atomic for my use case. You can see the code here. It's supposed to be implemented to the C++11 spec for all the orderings for load and store. Btw GCC 4.6 is not better in this regard. I don't know about GCC 4.7.