thread_local singleton performs lazy initialization by default? - c++

I have the following code of a thread_local singleton:
struct Singl {
int& ref;
Singl(int& r) : ref(r) {}
~Singl() {}
void print() { std::cout << &ref << std::endl; }
};
static auto& singl(int& r) {
static thread_local Singl i(r);
return i;
}
int main() {
int x = 4;
singl(x).print();
int y = 55;
singl(y).print();
return 0;
}
This program prints twice the reference to x.
The compiler (gcc 8.1 on godbolt) seems to do a lazy initialization of the singleton object:
singl(int&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR fs:0
add rax, OFFSET FLAT:guard variable for singl(int&)::i#tpoff
movzx eax, BYTE PTR [rax]
test al, al
jne .L5
mov rax, QWORD PTR [rbp-8]
mov rdx, QWORD PTR fs:0
add rdx, OFFSET FLAT:singl(int&)::i#tpoff
mov rsi, rax
mov rdi, rdx
call Singl::Singl(int&)
mov rax, QWORD PTR fs:0
add rax, OFFSET FLAT:guard variable for singl(int&)::i#tpoff
mov BYTE PTR [rax], 1
mov rax, QWORD PTR fs:0
add rax, OFFSET FLAT:singl(int&)::i#tpoff
mov edx, OFFSET FLAT:__dso_handle
mov rsi, rax
mov edi, OFFSET FLAT:_ZN5SinglD1Ev
call __cxa_thread_atexit
.L5:
mov rax, QWORD PTR fs:0
add rax, OFFSET FLAT:singl(int&)::i#tpoff
leave
ret
Is this the default behaviour I can expect whenever I make multiple calls to the singl-function passing different arguments? Or is it possible that the singleton object might be initialized a second time on a subsequent call?

This is indeed guaranteed. static/thread_local local variables are initialized exactly once, when control reaches the declaration.
A few points to take note of:
If multiple threads are calling the function concurrently, only one will perform the initialization, and the others will wait. This is what the guard variables in the disassembly is doing.
If the initialization throws an exception, it is considered incomplete, and will be performed again the next time control reaches it.
In other words, they _just work_™.

Related

C++ compiler: inline usage of a non-inline function defined in the same module

T.hpp
class T
{
int _i;
public:
int get() const;
int some_fun();
};
T.cpp
#include "T.hpp"
int T::get() const
{ return _i; }
int T::some_fun()
{
// noise
int i = get(); // (1)
// noise
}
get() is a non-inline function, however, it's defined in the same module as some_fun. Since the compiler can see the definition of get in the context of some_fun, do compilers, in optimized builds at least, apply the optimization of replacing get() by just _i in line (1)?
If I'm not wrong, I think that, with the exception of templates, the compiler only does a one-pass parsing. What if get is defined after some_fun?
Ok, I answered myself. I thought I didn't speak assembly but it wasn't that hard to try.
Code:
class T
{
int _i = 5;
public:
int get() const;
int some_fun();
};
int T::get() const { return _i; }
int T::some_fun()
{
int i = get();
return i;
}
int main()
{
T o;
return o.some_fun();
}
Non-optimized assembly output (using godbolt.org). A lot of stuff but you can see the explicit calls:
T::get() const:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
pop rbp
ret
T::some_fun():
push rbp
mov rbp, rsp
sub rsp, 24
mov QWORD PTR [rbp-24], rdi
mov rax, QWORD PTR [rbp-24]
mov rdi, rax
call T::get() const // !!!!
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
leave
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 5
lea rax, [rbp-4]
mov rdi, rax
call T::some_fun() // !!!!
nop
leave
ret
Optimized output (-O3):
T::get() const:
mov eax, DWORD PTR [rdi]
ret
T::some_fun():
mov eax, DWORD PTR [rdi]
ret
main:
mov eax, 5
ret
Here, some_fun has inlined the call to get (the call instruction has been removed and its definition is the same as get now), but the get function is still defined.
main went even further by doing an inline substitution of the call to some_fun and then realizing that o hasn't changed and at that point it still retains its default value of 5, so main directly returns 5 without even creating o.

Will delete the pointer to pointer cause the memory leak?

I have a question about C-style string symbolic constants and dynamically allocating arrays.
const char** name = new const char* { "Alan" };
delete name;
when I try to delete name after new'ing a piece of memory, the compiler suggest to me to use delete instead of delete[]. I understand name only stores the address of the pointer to the only-read string.
However, if I only delete the pointer to pointer (which is name), will the string itself cause a memory leak?
As the comments above indicate, you don't need to manage the memory that "Alan" exists in.
Let's see what that looks like in practice.
I made a modified version of your code:
#include <iostream>
void test() {
const char** name;
name = new const char* { "Alan\n" };
delete name;
}
int main()
{
test();
}
and then I popped it into godbolt and it shows what's happening under the hood. (excerpts copied below)
In both clang and gcc, the memory that stores "Alan\n" is in static memory so it always exists. This is how it creates no memory leak even though you never touch it again after mentioning it. The value of the pointer to "Alan\n" is just the position in the program's memory, offset .L.str or OFFSET FLAT:.LC0.
clang:
test(): # #test()
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 8
call operator new(unsigned long)
mov rcx, rax
movabs rdx, offset .L.str
mov qword ptr [rax], rdx
mov qword ptr [rbp - 8], rcx
mov rax, qword ptr [rbp - 8]
cmp rax, 0
mov qword ptr [rbp - 16], rax # 8-byte Spill
je .LBB1_2
mov rax, qword ptr [rbp - 16] # 8-byte Reload
mov rdi, rax
call operator delete(void*)
.L.str:
.asciz "Alan\n"
gcc:
.LC0:
.string "Alan\n"
test():
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 8
call operator new(unsigned long)
mov QWORD PTR [rax], OFFSET FLAT:.LC0
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
test rax, rax
je .L3
mov esi, 8
mov rdi, rax
call operator delete(void*, unsigned long)

Is this an old C++ style constructor?

Here a piece of C++ code.
In this example, many code blocks look like constructor calls.
Unfortunately, block code #3 is not (You can check it using https://godbolt.org/z/q3rsxn and https://cppinsights.io).
I think, it is an old C++ notation and it could explain the introduction of the new C++11 construction notation using {} (cf #4).
Do you have an explanation for T(i) meaning, so close to a constructor notation, but definitely so different?
struct T {
T() { }
T(int i) { }
};
int main() {
int i = 42;
{ // #1
T t(i); // new T named t using int ctor
}
{ // #2
T t = T(i); // new T named t using int ctor
}
{ // #3
T(i); // new T named i using default ctor
}
{ // #4
T{i}; // new T using int ctor (unnamed result)
}
{ // #5
T(2); // new T using int ctor (unnamed result)
}
}
NB: thus, T(i) (#3) is equivalent to T i = T();
The statement:
T(i);
is equivalent to:
T i;
In other words, it declares a variable named i with type T. This is because parentheses are allowed in declarations in some places (in order to change the binding of declarators) and since this statement can be parsed as a declaration, it is a declaration (even though it might make more sense as an expression).
You can use Compiler Explorer to see what happens in assembler.
You can see that #1,#2 #4 and #5 do same thing but strangly #3 call the other constructor (the base object constructor).
Does anyone have an explanation?
Assembler code :
::T() [base object constructor]:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
nop
pop rbp
ret
T::T(int):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov DWORD PTR [rbp-12], esi
nop
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 42
// #1
mov edx, DWORD PTR [rbp-4]
lea rax, [rbp-7]
mov esi, edx
mov rdi, rax
call T::T(int)
// #2
mov edx, DWORD PTR [rbp-4]
lea rax, [rbp-8]
mov esi, edx
mov rdi, rax
call T::T(int)
// #3
lea rax, [rbp-9]
mov rdi, rax
call T::T() [complete object constructor]
// #4
mov edx, DWORD PTR [rbp-4]
lea rax, [rbp-6]
mov esi, edx
mov rdi, rax
call T::T(int)
// #5
lea rax, [rbp-5]
mov esi, 2
mov rdi, rax
call T::T(int)
mov eax, 0
leave
ret

Should I create object on a free store only for one call?

Assuming that code is located inside if block, what are differences between creating object in a free store and doing only one call on it:
auto a = aFactory.createA();
int result = a->foo(5);
and making call directly on returned pointer?
int result = aFactory.createA()->foo(5);
Is there any difference in performance? Which way is better?
#include <iostream>
#include <memory>
class A
{
public:
int foo(int a){return a+3;}
};
class AFactory
{
public:
std::unique_ptr<A> createA(){return std::make_unique<A>();}
};
int main()
{
AFactory aFactory;
bool condition = true;
if(condition)
{
auto a = aFactory.createA();
int result = a->foo(5);
}
}
Look here. There is no difference in code generated for both versions even with optimisations disabled.
Using gcc7.1 with -std=c++1z -O0
auto a = aFactory.createA();
int result = a->foo(5);
is compiled to:
lea rax, [rbp-24]
lea rdx, [rbp-9]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
and int result = aFactory.createA()->foo(5); to:
lea rax, [rbp-16]
lea rdx, [rbp-17]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
So they are pretty much identical.
This outcome is understandable when you realise the only difference between the two versions is that in the first one we assign the name to our object, while in the second we work with an unnamed one. Other then that, they are both created on heap and used the same way. And since variable name means nothing to the compiler - it is only relevant for code-readers - it treats the two as if they were identical.
In your simple case it will not make a difference, because the (main) function ends right after creating and using a.
If some more lines of code would follow, the destruction of the a object would happen at the end of the if block in main, while in the one line case it becomes destructed at the end of that single line. However, it would be bad design if the destructor of a more sophisticated class A would make a difference on that.
Due to compiler optimizations performance questions should always be answered by testing with a profiler on the concrete code.

Why does gcc and clang produce very differnt code for member function template parameters?

I am trying to understand what is going on when a member function pointer is used as template parameter. I always thought that function pointers (or member function pointers) are a run-time concept, so I was wondering what happens when they are used as template parameters. For this reason I took a look a the output produced by this code:
struct Foo { void foo(int i){ } };
template <typename T,void (T::*F)(int)>
void callFunc(T& t){ (t.*F)(1); }
void callF(Foo& f){ f.foo(1);}
int main(){
Foo f;
callF(f);
callFunc<Foo,&Foo::foo>(f);
}
where callF is for comparison. gcc 6.2 produces the exact same output for both functions:
callF(Foo&): // void callFunc<Foo, &Foo::foo>(Foo&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov esi, 1
mov rdi, rax
call Foo::foo(int)
nop
leave
ret
while clang 3.9 produces almost the same output for callF():
callF(Foo&): # #callF(Foo&)
push rbp
mov rbp, rsp
sub rsp, 16
mov esi, 1
mov qword ptr [rbp - 8], rdi
mov rdi, qword ptr [rbp - 8]
call Foo::foo(int)
add rsp, 16
pop rbp
ret
but very different output for the template instantiation:
void callFunc<Foo, &Foo::foo>(Foo&): # #void callFunc<Foo, &Foo::foo>(Foo&)
push rbp
mov rbp, rsp
sub rsp, 32
xor eax, eax
mov cl, al
mov qword ptr [rbp - 8], rdi
mov rdi, qword ptr [rbp - 8]
test cl, 1
mov qword ptr [rbp - 16], rdi # 8-byte Spill
jne .LBB3_1
jmp .LBB3_2
.LBB3_1:
movabs rax, Foo::foo(int)
sub rax, 1
mov rcx, qword ptr [rbp - 16] # 8-byte Reload
mov rdx, qword ptr [rcx]
mov rax, qword ptr [rdx + rax]
mov qword ptr [rbp - 24], rax # 8-byte Spill
jmp .LBB3_3
.LBB3_2:
movabs rax, Foo::foo(int)
mov qword ptr [rbp - 24], rax # 8-byte Spill
jmp .LBB3_3
.LBB3_3:
mov rax, qword ptr [rbp - 24] # 8-byte Reload
mov esi, 1
mov rdi, qword ptr [rbp - 16] # 8-byte Reload
call rax
add rsp, 32
pop rbp
ret
Why is that? Is gcc taking some (possibly non-standard) shortcut?
gcc was able to figure out what the template was doing, and generated the simplest code possible. clang didn't. A compiler is permitted to perform any optimization as long as the observable results are compliant with the C++ specification. If optimizing away an intermediate function pointer, so be it. Nothing else in the code references the temporary function pointer, so it can be optimized away completely, and the whole thing replaced with a simple function call.
gcc and clang are different compilers, written by different people, with different approaches and algorithms for compiling C++.
It is natural, and expected to see different results from different compilers. In this case, gcc was able to figure things out better than clang. I'm sure there are other situations where clang will be able to figure things out better than gcc.
This test was done without any optimizations requested.
One compiler generated more verbose unoptimized code.
Unoptimized code is, quite simply, uninteresting. It is intended to be correct and easy to debug and derive directly from some intermediate representation that is easy to optimize.
The details of optimized code are what matter, barring a ridiculous and widespread slowdown that makes debugging painful.
There is nothing of interest to see or explain here.