Is this a bug in gcc optimizer? - c++

When I compile following code with gcc 6 -O3 -std=c++14, I get nice and empty main:
Dump of assembler code for function main():
0x00000000004003e0 <+0>: xor %eax,%eax
0x00000000004003e2 <+2>: retq
But uncommenting last line in main "breaks" optimization:
Dump of assembler code for function main():
0x00000000004005f0 <+0>: sub $0x78,%rsp
0x00000000004005f4 <+4>: lea 0x40(%rsp),%rdi
0x00000000004005f9 <+9>: movq $0x400838,0x10(%rsp)
0x0000000000400602 <+18>: movb $0x0,0x18(%rsp)
0x0000000000400607 <+23>: mov %fs:0x28,%rax
0x0000000000400610 <+32>: mov %rax,0x68(%rsp)
0x0000000000400615 <+37>: xor %eax,%eax
0x0000000000400617 <+39>: movl $0x0,(%rsp)
0x000000000040061e <+46>: movq $0x400838,0x30(%rsp)
0x0000000000400627 <+55>: movb $0x0,0x38(%rsp)
0x000000000040062c <+60>: movl $0x0,0x20(%rsp)
0x0000000000400634 <+68>: movq $0x400838,0x50(%rsp)
0x000000000040063d <+77>: movb $0x0,0x58(%rsp)
0x0000000000400642 <+82>: movl $0x0,0x40(%rsp)
0x000000000040064a <+90>: callq 0x400790 <ErasedObject::~ErasedObject()>
0x000000000040064f <+95>: lea 0x20(%rsp),%rdi
0x0000000000400654 <+100>: callq 0x400790 <ErasedObject::~ErasedObject()>
0x0000000000400659 <+105>: mov %rsp,%rdi
0x000000000040065c <+108>: callq 0x400790 <ErasedObject::~ErasedObject()>
0x0000000000400661 <+113>: mov 0x68(%rsp),%rdx
0x0000000000400666 <+118>: xor %fs:0x28,%rdx
0x000000000040066f <+127>: jne 0x400678 <main()+136>
0x0000000000400671 <+129>: xor %eax,%eax
0x0000000000400673 <+131>: add $0x78,%rsp
0x0000000000400677 <+135>: retq
0x0000000000400678 <+136>: callq 0x4005c0 <__stack_chk_fail#plt>
Code
#include <type_traits>
#include <new>
namespace
{
struct ErasedTypeVTable
{
using destructor_t = void (*)(void *obj);
destructor_t dtor;
};
template <typename T>
void dtor(void *obj)
{
return static_cast<T *>(obj)->~T();
}
template <typename T>
static const ErasedTypeVTable erasedTypeVTable = {
&dtor<T>
};
}
struct ErasedObject
{
std::aligned_storage<sizeof(void *)>::type storage;
const ErasedTypeVTable& vtbl;
bool flag = false;
template <typename T, typename S = typename std::decay<T>::type>
ErasedObject(T&& obj)
: vtbl(erasedTypeVTable<S>)
{
static_assert(sizeof(T) <= sizeof(storage) && alignof(T) <= alignof(decltype(storage)), "");
new (object()) S(std::forward<T>(obj));
}
ErasedObject(ErasedObject&& other) = default;
~ErasedObject()
{
if (flag)
{
::operator delete(object());
}
else
{
vtbl.dtor(object());
}
}
void *object()
{
return reinterpret_cast<char *>(&storage);
}
};
struct myType
{
int a;
};
int main()
{
ErasedObject c1(myType{});
ErasedObject c2(myType{});
//ErasedObject c3(myType{});
}
clang can optimize-out both versions.
Any ideas what's going on? Am I hitting some optimization limit? If so, is it configurable?

I ran g++ with -fdump-ipa-inline to get more information about why functions are or are not inlined.
For the testcase with main() function and three objects created I got:
(...)
150 Deciding on inlining of small functions. Starting with size 35.
151 Enqueueing calls in void {anonymous}::dtor(void*) [with T = myType]/40.
152 Enqueueing calls in int main()/35.
153 not inlinable: int main()/35 -> ErasedObject::~ErasedObject()/33, call is unlikely and code size would grow
154 not inlinable: int main()/35 -> ErasedObject::~ErasedObject()/33, call is unlikely and code size would grow
155 not inlinable: int main()/35 -> ErasedObject::~ErasedObject()/33, call is unlikely and code size would grow
(...)
This error code is set in gcc/gcc/ipa-inline.c:
else if (!e->maybe_hot_p ()
&& (growth >= MAX_INLINE_INSNS_SINGLE
|| growth_likely_positive (callee, growth)))
{
e->inline_failed = CIF_UNLIKELY_CALL;
want_inline = false;
}
Then I discovered, that the smallest change to make g++ inline these functions is to add a declaration:
int main() __attribute__((hot));
I wasn't able to find in code why int main() isn't considered hot, but probably this should be left for another question.
More interesting is the the second part of the conditional I pasted above. The intent was to not inline when the code will grow and you produced an example when the code shrinks after complete inlining.
I think this deserves to be reported on GCC's bugzilla, but I'm not sure if you can call it a bug - estimation of inline impact is a heuristic and as such it is expected to work correctly in most cases, not all of them.

Related

How to let MSVC compiler optimize multiple step POD initialization?

I've made this sample code:
#include <vector>
struct POD {
int a;
int b;
int c;
inline static POD make_pod_with_default()
{
POD p{ 41, 51, 61 };
return p;
}
inline void change_pod_a(POD &p, int a) {
p.a = a;
}
inline void change_pod_b(POD &p, int b) {
p.b = b;
}
static POD make_pod_with_a(int a) {
POD p = make_pod_with_default();
p.change_pod_a(p, a);
return p;
}
static POD make_pod_with_b(int a) {
POD p = make_pod_with_default();
p.change_pod_b(p, a);
return p;
}
};
int main()
{
std::vector<POD> vec{};
vec.reserve(2);
vec.push_back(POD::make_pod_with_a(71));
vec.push_back(POD::make_pod_with_b(81));
return vec[0].a + vec[0].b + vec[0].c + vec[1].a + vec[1].b + vec[1].c;
}
In the compiled assembly code we can see the following instructions are being generated for the first vec.push_back(...) call:
...
mov DWORD PTR $T2[esp+32], 41 ; 00000029H
...
mov DWORD PTR $T2[esp+36], 51 ; 00000033H
...
mov DWORD PTR $T5[esp+32], 71 ; 00000047H
...
mov DWORD PTR $T6[esp+44], 61 ; 0000003dH
...
There's a mov to [esp+32] for the 71, but the mov to [esp+32] for the 41 is still there, being useless! How can I write code for MSVC that will enable this kind of optimization, is MSVC even capable of it?
Both GCC and CLANG give more optimized versions, but CLANG defeats by a large margin with literally no overhead, in a very clean and logical fashion:
CLANG generated code:
main: # #main
push rax
mov edi, 24
call operator new(unsigned long)
mov rdi, rax
call operator delete(void*)
mov eax, 366
pop rcx
ret
Everything is done at compile time as 71 + 51 + 61 + 41 + 81 + 61 = 366!
I must admit its painful to see my program being computed at compile time and still throw in that call to vec.reserve() in the assembly... but CLANG still takes the cake, by far! Come on MSVC, this is not a vector of volatile.
If you turn your methods constexpr, you might do:
constexpr POD step_one()
{
POD p{2, 5, 11};
p.b = 3;
return p;
}
constexpr void step_two(POD &p)
{
p.c = 5;
}
constexpr POD make_pod(){
POD p = step_one();
step_two(p);
return p;
}
POD make_pod_final()
{
constexpr POD res = make_pod();
return res;
}
resulting to:
make_pod_final PROC
mov eax, DWORD PTR $T1[esp-4]
mov DWORD PTR [eax], 2
mov DWORD PTR [eax+4], 3
mov DWORD PTR [eax+8], 5
ret 0
Demo

Cost of thread-safe local static variable initialization in C++11?

We know that local static variable initialization is thread-safe in C++11, and modern compilers fully support this. (Is local static variable initialization thread-safe in C++11?)
What is the cost of making it thread-safe? I understand that this could very well be compiler implementation dependent.
Context: I have a multi-threaded application (10 threads) accessing a singleton object pool instance via the following function at very high rates, and I'm concerned about its performance implications.
template <class T>
ObjectPool<T>* ObjectPool<T>::GetInst()
{
static ObjectPool<T> instance;
return &instance;
}
A look at the generated assembler code helps.
Source
#include <vector>
std::vector<int> &get(){
static std::vector<int> v;
return v;
}
int main(){
return get().size();
}
Assembler
std::vector<int, std::allocator<int> >::~vector():
movq (%rdi), %rdi
testq %rdi, %rdi
je .L1
jmp operator delete(void*)
.L1:
rep ret
get():
movzbl guard variable for get()::v(%rip), %eax
testb %al, %al
je .L15
movl get()::v, %eax
ret
.L15:
subq $8, %rsp
movl guard variable for get()::v, %edi
call __cxa_guard_acquire
testl %eax, %eax
je .L6
movl guard variable for get()::v, %edi
movq $0, get()::v(%rip)
movq $0, get()::v+8(%rip)
movq $0, get()::v+16(%rip)
call __cxa_guard_release
movl $__dso_handle, %edx
movl get()::v, %esi
movl std::vector<int, std::allocator<int> >::~vector(), %edi
call __cxa_atexit
.L6:
movl get()::v, %eax
addq $8, %rsp
ret
main:
subq $8, %rsp
call get()
movq 8(%rax), %rdx
subq (%rax), %rdx
addq $8, %rsp
movq %rdx, %rax
sarq $2, %rax
ret
Compared to
Source
#include <vector>
static std::vector<int> v;
std::vector<int> &get(){
return v;
}
int main(){
return get().size();
}
Assembler
std::vector<int, std::allocator<int> >::~vector():
movq (%rdi), %rdi
testq %rdi, %rdi
je .L1
jmp operator delete(void*)
.L1:
rep ret
get():
movl v, %eax
ret
main:
movq v+8(%rip), %rax
subq v(%rip), %rax
sarq $2, %rax
ret
movl $__dso_handle, %edx
movl v, %esi
movl std::vector<int, std::allocator<int> >::~vector(), %edi
movq $0, v(%rip)
movq $0, v+8(%rip)
movq $0, v+16(%rip)
jmp __cxa_atexit
I'm not that great with assembler, but I can see that in the first version v has a lock around it and get is not inlined whereas in the second version get is essentially gone.
You can play around with various compilers and optimization flags, but it seems no compiler is able to inline or optimize out the locks, even though the program is obviously single threaded.
You can add static to get which makes gcc inline get while preserving the lock.
To know how much these locks and additional instructions cost for your compiler, flags, platform and surrounding code you would need to make a proper benchmark.
I would expect the locks to have some overhead and be significantly slower than the inlined code, which becomes insignificant when you actually do work with the vector, but you can never be sure without measuring.
From my experience, this is exactly as costly as a regular mutex (critical section). If the code is called very frequently, consider using a normal global variable instead.
Explained extensively here https://www.youtube.com/watch?v=B3WWsKFePiM by Jason Turner.
I put a sample code to illustrate the video. Since thread-safety is the main issue, I tried to call the method from multiple threads to see its effects.
You can think that compiler is implementing double-checking lock for you even though they can do whatever they want to ensure thread-safety. But they will at least add a branch to distinguish first time initialization unless optimizer does initialization at the global scope eagerly.
https://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_C++11
#include <iostream>
#include <string>
#include <vector>
#include <thread>
struct Temp
{
// Everytime this method is called, compiler has to check whether `name` is
// constructed or not due to init-at-first-use idiom. This at least would
// involve an atomic load operation and maybe a lock acquisition.
static const std::string& name() {
static const std::string name = "name";
return name;
}
// Following does not create contention. Profiler showed little bit of
// performance improvement.
const std::string& ref_name = name();
const std::string& get_name_ref() const {
return ref_name;
}
};
int main(int, char**)
{
Temp tmp;
constexpr int num_worker = 8;
std::vector<std::thread> threads;
for (int i = 0; i < num_worker; ++i) {
threads.emplace_back([&](){
for (int i = 0; i < 10000000; ++i) {
// name() is almost 5s slower
printf("%zu\n", tmp.get_name_ref().size());
}
});
}
for (int i = 0; i < num_worker; ++i) {
threads[i].join();
}
return 0;
}
The name() version is 5s slower than get_name_ref() on my machine.
$ time ./test > /dev/null
Also I used compiler explorer to see what gcc generates. Following proves double checking lock pattern: Pay attention to atomic loads and guards acquired.
name ()
{
bool retval.0;
bool retval.1;
bool D.25443;
struct allocator D.25437;
const struct string & D.29013;
static const struct string name;
_1 = __atomic_load_1 (&_ZGVZL4namevE4name, 2);
retval.0 = _1 == 0;
if (retval.0 != 0) goto <D.29003>; else goto <D.29004>;
<D.29003>:
_2 = __cxa_guard_acquire (&_ZGVZL4namevE4name);
retval.1 = _2 != 0;
if (retval.1 != 0) goto <D.29006>; else goto <D.29007>;
<D.29006>:
D.25443 = 0;
try
{
std::allocator<char>::allocator (&D.25437);
try
{
try
{
std::__cxx11::basic_string<char>::basic_string (&name, "name", &D.25437);
D.25443 = 1;
__cxa_guard_release (&_ZGVZL4namevE4name);
__cxa_atexit (__dt_comp , &name, &__dso_handle);
}
finally
{
std::allocator<char>::~allocator (&D.25437);
}
}
finally
{
D.25437 = {CLOBBER};
}
}
catch
{
if (D.25443 != 0) goto <D.29008>; else goto <D.29009>;
<D.29008>:
goto <D.29010>;
<D.29009>:
__cxa_guard_abort (&_ZGVZL4namevE4name);
<D.29010>:
}
goto <D.29011>;
<D.29007>:
<D.29011>:
goto <D.29012>;
<D.29004>:
<D.29012>:
D.29013 = &name;
return D.29013;
}

c++ criticalsection for getter

I have a simple class with one private member that is accessible via get() and set() in a multithreaded environment (multi readers/multi writers). how do I lock a Get() as it only has a return statement?
class MyValue
{
private:
System::CriticalSection lock;
int val { 0 };
public:
int SetValue(int arg)
{
lock.Enter();
val = arg;
lock.Leave();
}
int GetValue()
{
lock.Enter();
return val;
//Where should I do lock.Leave()?
}
}
Don't lock anything. In your example, it is enough if you make your member an std::atomic integer.
You do not need anything else here. As a matter of fact, due to Intel architecture (strong memory ordering model), this std::atomic is not even likely to cause any performance issues.
I'm not a multithreading expert, but I think following should work.
int GetValue()
{
lock.Enter();
int ret = val;
lock.Leave();
return ret;
}
This is a demonstration of the synchronization object from hauron's answer -- I wanted to show that object construction and destruction overhead simply does not exist with an optomized build.
In the code below, CCsGrabber is an RAII-like class which enters a critical section (wrapped by a CCritical object) when constructed, then leaves it when destroyed:
class CCsGrabber {
class CCritical& m_Cs;
CCsGrabber();
public:
CCsGrabber(CCritical& cs);
~CCsGrabber();
};
class CCritical {
CRITICAL_SECTION cs;
public:
CCritical() {
InitializeCriticalSection(&cs);
}
~CCritical() { DeleteCriticalSection(&cs); }
void Enter() { EnterCriticalSection(&cs); }
void Leave() { LeaveCriticalSection(&cs); }
void Lock() { Enter(); }
void Unlock() { Leave(); }
};
inline CCsGrabber::CCsGrabber(CCritical& cs) : m_Cs(cs) { m_Cs.Enter(); }
inline CCsGrabber::CCsGrabber(CCritical *pcs) : m_Cs(*pcs) { m_Cs.Enter(); }
inline CCsGrabber::~CCsGrabber() { m_Cs.Leave(); }
Now, a global CCritical object is created (cs), which is used in SerialFunc(), along with a local CCsGrabber instance (csg) to take care of locking and unlocking:
CCritical cs;
DWORD last_tick = 0;
void SerialFunc() {
CCsGrabber csg(cs);
last_tick = GetTickCount();
}
int main() {
SerialFunc();
std::cout << last_tick << std::endl;
}
And below is the dissasembly of main() from an optimized 32-bit build. (I apologize for pasting in the whole thing -- I wanted to show that I wasn't hiding anything:
int main() {
00401C80 push ebp
00401C81 mov ebp,esp
00401C83 and esp,0FFFFFFF8h
00401C86 push 0FFFFFFFFh
00401C88 push 41B038h
00401C8D mov eax,dword ptr fs:[00000000h]
00401C93 push eax
00401C94 mov dword ptr fs:[0],esp
00401C9B sub esp,0Ch
00401C9E push esi
00401C9F push edi
SerialFunc();
00401CA0 push 427B78h ; pointer to CS object
00401CA5 call dword ptr ds:[41C00Ch] ; _RtlEnterCriticalSection#4:
00401CAB call dword ptr ds:[41C000h] ; _GetTickCountStub#0:
00401CB1 push 427B78h ; pointer to CS object
00401CB6 mov dword ptr ds:[00427B74h],eax ; return value => last_tick
00401CBB call dword ptr ds:[41C008h] ; _RtlLeaveCriticalSection#4:
std::cout << last_tick << std::endl;
00401CC1 push ecx
00401CC2 call std::basic_ostream<char,std::char_traits<char> >::operator<< (0401D90h)
00401CC7 mov esi,eax
00401CC9 lea eax,[esp+0Ch]
00401CCD push eax
00401CCE mov ecx,dword ptr [esi]
00401CD0 mov ecx,dword ptr [ecx+4]
00401CD3 add ecx,esi
00401CD5 call std::ios_base::getloc (0401BD0h)
00401CDA push eax
00401CDB mov dword ptr [esp+20h],0
00401CE3 call std::use_facet<std::ctype<char> > (0403E40h)
00401CE8 mov dword ptr [esp+20h],0FFFFFFFFh
00401CF0 add esp,4
00401CF3 mov ecx,dword ptr [esp+0Ch]
00401CF7 mov edi,eax
00401CF9 test ecx,ecx
00401CFB je main+8Eh (0401D0Eh)
00401CFD mov edx,dword ptr [ecx]
00401CFF call dword ptr [edx+8]
00401D02 test eax,eax
00401D04 je main+8Eh (0401D0Eh)
00401D06 mov edx,dword ptr [eax]
00401D08 mov ecx,eax
00401D0A push 1
00401D0C call dword ptr [edx]
00401D0E mov eax,dword ptr [edi]
00401D10 mov ecx,edi
00401D12 push 0Ah
00401D14 mov eax,dword ptr [eax+20h]
00401D17 call eax
00401D19 movzx eax,al
00401D1C mov ecx,esi
00401D1E push eax
00401D1F call std::basic_ostream<char,std::char_traits<char> >::put (0404220h)
00401D24 mov ecx,esi
00401D26 call std::basic_ostream<char,std::char_traits<char> >::flush (0402EB0h)
}
00401D2B mov ecx,dword ptr [esp+14h]
00401D2F xor eax,eax
00401D31 pop edi
00401D32 mov dword ptr fs:[0],ecx
00401D39 pop esi
00401D3A mov esp,ebp
00401D3C pop ebp
00401D3D ret
So we can see that SerialFunc() was inlined directly into main, after prologue at the beginning and before the cout code -- and nowhere to be found is any superflouous object creation, memory allocation or anything -- it just looks like the minimum amount of assembly code required to enter the critical section, get the tick count in a variable, and then leave the critical section.
Then I changed SerialFunc() to:
void SerialFunc() {
cs.Enter();
last_tick = GetTickCount();
cs.Leave();
}
With explicitly-placed cs.Enter() and cs.Leave(), just to compare with the RAII version. The generated code turned out to be identical:
int main() {
00401C80 push ebp
00401C81 mov ebp,esp
00401C83 and esp,0FFFFFFF8h
00401C86 push 0FFFFFFFFh
00401C88 push 41B038h
00401C8D mov eax,dword ptr fs:[00000000h]
00401C93 push eax
00401C94 mov dword ptr fs:[0],esp
00401C9B sub esp,0Ch
00401C9E push esi
00401C9F push edi
SerialFunc();
00401CA0 push 427B78h
00401CA5 call dword ptr ds:[41C00Ch]
00401CAB call dword ptr ds:[41C000h]
00401CB1 push 427B78h
00401CB6 mov dword ptr ds:[00427B74h],eax
00401CBB call dword ptr ds:[41C008h]
std::cout << last_tick << std::endl;
00401CC1 push ecx
00401CC2 call std::basic_ostream<char,std::char_traits<char> >::operator<< (0401D90h)
...
In my opinion, SergeyA's answer is best for the given situation -- a critical section for synchronizing reads and writes from/to 32-bit variables is excessive. However, if something comes up which calls for a critical section or mutex, using an RAII-like object to simplify your code is probably not going to incur significant (or even any) object creation overhead.
(I used Visual C++ 2013 to compile the code above)
Consider using a class wrapper locking in ctor, and unlocking in dtor. See standard implementation: http://en.cppreference.com/w/cpp/thread/unique_lock
This way you don't need to remember about unlocking in case of complex code or exceptions thrown within your code, altering the normal execution.

Local static pointer variable is not thread safe?

At my server module, sometimes log4cxx library made it crash.
It's because ...
LevelPtr Level::getTrace() {
static LevelPtr level(new Level(Level::TRACE_INT, LOG4CXX_STR("TRACE"), 7));
return level;
}
static LevelPtr returns null ptr.
I tested following code.
int start_flag = 0;
class test_dummy {
public:
int mi;
test_dummy() : mi(1)
{
std::cout << "hey!\n";
}
static test_dummy* get_p()
{
static test_dummy* _p = new test_dummy();
return _p;
}
};
void thread_proc()
{
int i = 0;
while (start_flag == 0)
{
i++;
}
if (test_dummy::get_p() == 0)
{
std::cout << "error!!!\n";
}
else
{
std::cout << "mi:" << test_dummy::get_p()->mi << "\n";
}
}
void main()
{
boost::thread *pth_array[5] = {0,};
for (int i = 0; i < 5; i++)
{
pth_array[i] = new boost::thread(thread_proc);
}
start_flag = 1;
for (int i = 0; i < 5; i++)
{
pth_array[i]->join();
}
std::cin.ignore();
}
It's really thread-unsafe, but I'm curious about why get_p() return null pointer not another allocated address.
It's because the value was set to 0 while one is doing new() operation?
You have a race condition in this code that the compiler provides:
if (!level_initialized)
{
level_initialized = 1;
level = new Level(...);
}
return level;
(It doesn't look EXACTLY like that - it's more complex, but I think you get the general idea)
In clang++ 3.5, it seems like there are locks to prevent this sort of race, but without actually looking at the code generated by your compiler, it's impossible to say exactly what is going on. But I suspect that this is what happens.
Here's what clang++ 3.5 generates (minus some clutter)
_Z8getTracev: # #_Z8getTracev
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
cmpb $0, _ZGVZ8getTracevE5level # Guard variable
jne .LBB0_4
leaq _ZGVZ8getTracevE5level, %rdi
callq __cxa_guard_acquire
cmpl $0, %eax
je .LBB0_4
.Ltmp0:
movl $4, %eax
movl %eax, %edi
callq _Znwm # new
.Ltmp1:
movq %rax, -24(%rbp) # 8-byte Spill
jmp .LBB0_3
.LBB0_3: # %invoke.cont
leaq _ZGVZ8getTracevE5level, %rdi
movq -24(%rbp), %rax # 8-byte Reload
movq -24(%rbp), %rcx # 8-byte Reload
movl $0, (%rcx)
movq %rax, _ZZ8getTracevE5level
callq __cxa_guard_release
.LBB0_4: # %init.end
movq _ZZ8getTracevE5level, %rax
addq $32, %rsp
popq %rbp
retq
I modified the code to use Level as an int, etc, so it's simpler than the code you'd get from exactly the code you posted.
It's hard to say much, since the code clearly has undefined
behavior, but the standard does require level to be
initialized to a null pointer before get_p is called. And in
order to ensure that the local static is initialized exactly
once, the compiler more or less has to add an extra flag;
something like:
static test_dummy* _p = nullptr;
static bool isInitialized = false;
if ( !isInitialized ) {
_p = new test_dummy();
isInitialized = true;
}
(In fact, of course, the initializations shown above are the
zero initialization , which occurs before anything else. And
a clever compiler could realize that the explicit initialization
which occurs the first time through cannot result in _p being
a null pointer, and use _p as the control variable.)
The above isn't thread safe; in order to make it thread safe,
the entire sequence must be protected. (There are also more or
less complicated tricks to avoid the need for a full mutex, but
in all cases, all accesses to isInitialized must be atomic.)
If the sequence isn't protected, the order another thread sees
the writes isn't defined. So some thread is seeing
isInitialized as true, but still seeing the null pointer in
_p.

Unexpected flow of control (compiler-bug?) using errno as argument for exception in C++ (g++)

When using C++ exceptions to transport errno state, the compiled code that gets generated by g++ (4.5.3) for code such as the following
#include <cerrno>
#include <stdexcept>
#include <string>
class oserror : public std::runtime_error {
private:
static std::string errnotostr(int errno_);
public:
explicit oserror(int errno_) :
std::runtime_error(errnotostr(errno_)) {
}
};
void test() {
throw oserror(errno);
}
is rather unexpectedly (on Linux, x86_64)
.type _Z4testv, #function
...
movl $16, %edi
call __cxa_allocate_exception
movq %rax, %rbx
movq %rbx, %r12
call __errno_location
movl (%rax), %eax
movl %eax, %esi
movq %r12, %rdi
call _ZN7oserrorC1Ei
What this basically means is that errno as an argument to a C++ exception is pretty much useless due to the call to __cxa_allocate_exception preceding the call to __errno_location (which is the macro content of errno), where the former calls std::malloc and does not save errno state (at least as far as I understood the sources of __cxa_allocate_exception in eh_alloc.cc of libstdc++).
This means that in the case that memory allocation fails, the error number that was actually to be passed into the exception object gets overwritten with the error number that std::malloc set up. std::malloc gives no guarantee to save an existing errno state, anyway, even in the case of successful exit - so the above code is definitely broken in the general case.
On Cygwin, x86, the code that gets compiled (also using g++ 4.5.3) for test() is okay, though:
.def __Z4testv; .scl 2; .type 32; .endef
...
call ___errno
movl (%eax), %esi
movl $8, (%esp)
call ___cxa_allocate_exception
movl %eax, %ebx
movl %ebx, %eax
movl %esi, 4(%esp)
movl %eax, (%esp)
call __ZN7oserrorC1Ei
Does this mean that for library code to properly wrap errno state in an exception, I'll always have to use a macro which expands to something like
int curerrno_ = errno;
throw oserror(curerrno_);
I actually can't seem to find the corresponding section of the C++ standard which says anything about evaluation order in the case of exceptions, but to me it seems that the g++ generated code on x86_64 (on Linux) is broken due to allocating memory for the exception object before collecting the parameters for its constructor, and that this is a compiler bug in some way. Am I right, or is this some fundamentally wrong thinking on my part?
What this basically means is that errno as an argument to a C++ exception is pretty much useless due to the call to __cxa_allocate_exception preceding the call to __errno_location (which is the macro content of errno), where the former calls std::malloc and does not save errno state (at least as far as I understood the sources of __cxa_allocate_exception in eh_alloc.cc of libstdc++).
This is not true. As far as I have checked the source code, the only "thing" inside __cxa_allocate_exception that can change errno is malloc(). Two cases may occur:
malloc() succeeds, then errno is unchanged;
malloc() fails, then std::terminate() is called and your oserror() is never constructed.
Therefore, since calling _cxa_allocate_exception before calling your constructor does not functionally change your program, I believe g++ has the right to do so.
Please note that __cxa_allocate_exception is done before your constructor is actually called.
32:std_errno.cpp **** throw oserror( errno );
352 0007 BF100000 movl $16, %edi
;;;; Exception space allocation:
355 000c E8000000 call __cxa_allocate_exception
356 0011 4889C3 movq %rax, %rbx
;;;; "errno" evaluation:
357 0014 E8000000 call __errno_location
358 0019 8B00 movl (%rax), %eax
359 001b 89C6 movl %eax, %esi
360 001d 4889DF movq %rbx, %rdi
;;;; Constructor called here:
362 0020 E8000000 call _ZN7oserrorC1Ei
So it makes sense. __cxa_allocate_exception just allocates space to an exception, but does not construct it (libc++abi Specification).
When your exception object is built, errno is arleady evaluated.
I took your example and implemented errnotostr:
// Unexpected flow of control (compiler-bug?) using errno as argument for exception in C++ (g++)
#include <cerrno>
#include <stdexcept>
#include <string>
#include <iostream>
#include <cstring>
#include <sstream>
class oserror : public std::runtime_error
{
private:
static std::string errnotostr(int errno_)
{
std::stringstream ss;
ss << "[" << errno_ << "] " << std::strerror( errno_ );
return ss.str( );
}
public:
explicit oserror( int errno_ )
: std::runtime_error( errnotostr( errno_ ) )
{
}
};
void test( )
{
throw oserror( errno );
}
int main( )
{
try
{
std::cout << "Enter a value to errno: ";
std::cin >> errno;
std::cout << "Test with errno = " << errno << std::endl;
test( );
}
catch ( oserror &o )
{
std::cout << "Exception caught: " << o.what( ) << std::endl;
return 1;
}
return 0;
}
Then I compiled with -O0 and -O2, run and got the same results, all according to expectations:
> ./std_errno
Enter a value to errno: 1
Test with errno = 1Exception caught: [1] Operation not permitted
> ./std_errno
Enter a value to errno: 11
Test with errno = 11
Exception caught: [11] Resource temporarily unavailable
> ./std_errno
Enter a value to errno: 111
Test with errno = 111
Exception caught: [111] Connection refused
(Running on 64-bits Opensuse 12.1, G++ 4.6.2)