Pimpl + QSharedPointer - Destructor = Disaster - c++

Yesterday I ran into misery which took me 24 hours of frustration. The problem boiled down to unexpected crashes occurring on random basis. To complicate things, debugging reports had absolutely random pattern as well. To complicate it even more, all debugging traces were leading to either random Qt sources or native DLLs, i.e. proving every time that the issue is rather not on my side.
Here you are a few examples of such lovely reports:
Program received signal SIGSEGV, Segmentation fault.
0x0000000077864324 in ntdll!RtlAppendStringToString () from C:\Windows\system32\ntdll.dll
(gdb) bt
#0 0x0000000077864324 in ntdll!RtlAppendStringToString () from C:\Windows\system32\ntdll.dll
#1 0x000000002efc0230 in ?? ()
#2 0x0000000002070005 in ?? ()
#3 0x000000002efc0000 in ?? ()
#4 0x000000007787969f in ntdll!RtlIsValidHandle () from C:\Windows\system32\ntdll.dll
#5 0x0000000000000000 in ?? ()
warning: HEAP: Free Heap block 307e5950 modified at 307e59c0 after it was freed
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000778bf0b2 in ntdll!ExpInterlockedPopEntrySListFault16 () from C:\Windows\system32\ntdll.dll
(gdb) bt
#0 0x00000000778bf0b2 in ntdll!ExpInterlockedPopEntrySListFault16 () from C:\Windows\system32\ntdll.dll
#1 0x000000007786fd34 in ntdll!RtlIsValidHandle () from C:\Windows\system32\ntdll.dll
#2 0x0000000077910d20 in ntdll!RtlGetLastNtStatus () from C:\Windows\system32\ntdll.dll
#3 0x00000000307e5950 in ?? ()
#4 0x00000000307e59c0 in ?? ()
#5 0x00000000ffffffff in ?? ()
#6 0x0000000000220f10 in ?? ()
#7 0x0000000077712d60 in WaitForMultipleObjectsEx () from C:\Windows\system32\kernel32.dll
#8 0x0000000000000000 in ?? ()
Program received signal SIGSEGV, Segmentation fault.
0x0000000000a9678a in QBasicAtomicInt::ref (this=0x8) at ../../include/QtCore/../../../qt-src/src/corelib/arch/qatomic_x86_64.h:121
121 : "memory");
(gdb) bt
#0 0x0000000000a9678a in QBasicAtomicInt::ref (this=0x8) at ../../include/QtCore/../../../qt-src/src/corelib/arch/qatomic_x86_64.h:121
#1 0x00000000009df08e in QVariant::QVariant (this=0x21e4d0, p=...) at d:/Distributions/qt-src/src/corelib/kernel/qvariant.cpp:1426
#2 0x0000000000b4dde9 in QList<QVariant>::value (this=0x323bd480, i=1) at ../../include/QtCore/../../../qt-src/src/corelib/tools/qlist.h:666
#3 0x00000000009ccff7 in QObject::property (this=0x3067e900,
name=0xa9d042a <QCDEStyle::drawPrimitive(QStyle::PrimitiveElement, QStyleOption const*, QPainter*, QWidget const*) const::pts5+650> "_q_stylerect")
at d:/Distributions/qt-src/src/corelib/kernel/qobject.cpp:3742
#4 0x0000000000000000 in ?? ()
As you can see this stuff is pretty nasty, it gives one no useful information. But, there was one thing I didn't pay attention to. It was a weird warning during compilation which is also hard to catch with an eye:
In file included from d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer.h:50:0,
from d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/QSharedPointer:1,
from ../../../../source/libraries/Project/sources/Method.hpp:4,
from ../../../../source/libraries/Project/sources/Slot.hpp:4,
from ../../../../source/libraries/Project/sources/Slot.cpp:1:
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h: In instantiation of 'static void QtSharedPointer::ExternalRefCount<T>::deref(QtSharedPointer::ExternalRefCount<T>::Data*, T*) [with T = Project::Method::Private; QtSharedPointer::ExternalRefCount<T>::Data = QtSharedPointer::ExternalRefCountData]':
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:336:11: required from 'void QtSharedPointer::ExternalRefCount<T>::deref() [with T = Project::Method::Private]'
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:401:38: required from 'QtSharedPointer::ExternalRefCount<T>::~ExternalRefCount() [with T = Project::Method::Private]'
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:466:7: required from here
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:342:21: warning: possible problem detected in invocation of delete operator: [enabled by default]
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:337:28: warning: 'value' has incomplete type [enabled by default]
Actually, I turned to this warning only as a last resort because in such a desperate pursuit to find a bug, the code was already infected with logging to death literally.
After reading it carefully, I recalled that, for instance, if one uses std::unique_ptr or std::scoped_ptr for Pimpl - one should certainly provide desctructor, otherwise the code won't even compile. However, I also remember that std::shared_ptr does not care about destructor and works fine without it. It was another reason why I didn't pay attention to this strange warning. Long story short, when I added destructor, this random crashing stopped. Looks like Qt's QSharedPointer has some design flaws compared to std::shared_ptr. I guess it would be better, if Qt developers transformed this warning into error because debugging marathons like that are simply not worth one's time, effort and nerves.
My questions are:
What's wrong with QSharedPointer? Why destructor is so vital?
Why crashing happened when there was no destructor? These objects (which are using Pimpl + QSharedPointer) are created on stack and no other objects have access to them after their death. However, crashing happened during some random period of time after their death.
Has anyone ran into issues like that before? Please, share your experience.
Are there other pitfalls
like that in Qt - ones that I must know about for sure to stay
safe in future?
Hopefully, these questions and my post in general will help others to avoid the hell I've been to for the past 24 hours.

The issue has been worked around in Qt 5, see https://codereview.qt-project.org/#change,26974
The compiler calling the wrong destructor or assuming a different memory layout probably lead to some kind of memory corruption. I'd say a compiler should give an error for this issue and not a warning.

You'll run into a similar issue with std::unique_ptr, which can also cause broken destructors if used with an incomplete type. The fix is pretty trivial, of course - I declare a constructor for the class, then define it in the implementation file as
MyClass::~MyClass() = default;
The reason that this is an issue for std::unique_ptr but not std::shared_ptr is that the destructor is part of the type of the former, but is a member of the latter.

Related

gdb program core: don't get the specific bug

I use gdb command as follows to localize the segmentation fault, but it shows ?? so that I am confused. What does it mean? How to avoid it?
$ gdb program core
...
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000048d0000048c in ?? ()
(gdb) bt
#0 0x0000046a00000469 in ?? ()
#1 0x0000046c0000046b in ?? ()
#2 0x0000046e0000046d in ?? ()
#3 0x000004700000046f in ?? ()
#4 0x0000047300000472 in ?? ()
#5 0x0000047600000475 in ?? ()
#6 0x0000047800000477 in ?? ()
#7 0x0000047a00000479 in ?? ()
#8 0x0000047d0000047b in ?? ()
...
I find that the array is out of bounds and I solved it. But I still confused with the phenomenon above.
0x0000048d0000048c
This looks like you've called a function through a function pointer, but that pointer has been overwritten with two integers: 0x48d == 1165 and 0x48c == 1164 (do these values look like something that your program is using?).
You should use bt to tell you how you got there.
You should probably use Valgrind or Address Sanitizer to check for uninitialized or dangling memory and buffer overflow (which are some of the common ways to end up with invalid function pointer).
Update:
Now that you show the stack trace, it's an almost 100% guarantee that you have some local array of integers which you've overflown (filling it with values like 1129, 1130, 1131, etc.), thus corrupting your stack.
Address Sanitizer (available in recent versions of GCC) should point you straight at where the bug is.
This means that your program crashed in a function unknow by gdb (function not provided by the symbol table)
try these two options, in the given order:
if you are debugging a target, be sure that all your code layers are compiled with the option -g if you are using gcc.
You can give manually the symbol table to gdb with the command file "binary_with_symbol_table" and it will give you the function and the address of the bug.
Note that many exceptions may be hidden behind a segmentation fault.

What are the possible reasons for POSIX SIGBUS?

My program recently crashed with the following stack;
Program terminated with signal 7, Bus error.
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
#1 0x00007f0f35f8042e in skgesigOSCrash () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#2 0x00007f0f36222ca9 in kpeDbgSignalHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#3 0x00007f0f35f8063e in skgesig_sigactionHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#4 <signal handler called>
What should I check in my code to avoid this? Or is this something Oracle should fix?
Main reasons you could get a bus error revolves around inaccessible memory. This could be due to many reasons:
Accessing through a deleted pointer.
Accessing through an uninitialized pointer.
Accessing through a NULL pointer.
Accessing the address which is not yours. It could be due to overflow errors.
Try adding the following to the $ORACLE_HOME/network/admin/*.ora file:
DIAG_ADR_ENABLED=OFF
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE
This sounds like an Oracle issue.
And also Oracle's libraries seem to be compiled by Intel compilers.

double free or corruption (!prev) while calling __do_global_dtors_aux

I'm getting this error message after my app has done everything right
/lib64/libc.so.6[0x3f1ee70d7f]
/lib64/libc.so.6(cfree+0x4b)[0x3f1ee711db]
/home/user/workspace/NewProject/build/bin/TestApp(_ZN9__gnu_cxx13new_allocatorIN5boost10shared_ptrINS1_5uuids4uuidEEEE10deallocateEPS5_m+0x20)[0x49c174]
/home/user/workspace/NewProject/build/bin/TestApp(_ZNSt12_Vector_baseIN5boost10shared_ptrINS0_5uuids4uuidEEESaIS4_EE13_M_deallocateEPS4_m+0x32)[0x495b84]
/home/user/workspace/NewProject/build/bin/TestApp(_ZNSt12_Vector_baseIN5boost10shared_ptrINS0_5uuids4uuidEEESaIS4_EED2Ev+0x47)[0x49598b]
/home/user/workspace/NewProject/build/bin/TestApp(_ZNSt6vectorIN5boost10shared_ptrINS0_5uuids4uuidEEESaIS4_EED1Ev+0x65)[0x48bf27]
/lib64/libc.so.6(__cxa_finalize+0x8e)[0x3f1ee337fe]
/home/user/workspace/NewProject/build/components/lib_path/libhelper-d.so[0x2aaaab052b36]
If I run the program in gdb I can get the following backtrace, but it is all I get:
#0 0x0000003f1ee30285 in raise () from /lib64/libc.so.6
#1 0x0000003f1ee31d30 in abort () from /lib64/libc.so.6
#2 0x0000003f1ee692bb in __libc_message () from /lib64/libc.so.6
#3 0x0000003f1ee70d7f in _int_free () from /lib64/libc.so.6
#4 0x0000003f1ee711db in free () from /lib64/libc.so.6
#5 0x000000000049c174 in __gnu_cxx::new_allocator<boost::shared_ptr<boost::uuids::uuid> >::deallocate (this=0x2aaaab2cea50, __p=0x1cfd8d0)
at /opt/local/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.5/../../../../include/c++/4.4.5/ext/new_allocator.h:95
#6 0x0000000000495b84 in std::_Vector_base<boost::shared_ptr<boost::uuids::uuid>, std::allocator<boost::shared_ptr<boost::uuids::uuid> > >::_M_deallocate (
this=0x2aaaab2cea50, __p=0x1cfd8d0, __n=8) at /opt/local/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.5/../../../../include/c++/4.4.5/bits/stl_vector.h:146
#7 0x000000000049598b in std::_Vector_base<boost::shared_ptr<boost::uuids::uuid>, std::allocator<boost::shared_ptr<boost::uuids::uuid> > >::~_Vector_base (
this=0x2aaaab2cea50, __in_chrg=<value optimized out>)
at /opt/local/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.5/../../../../include/c++/4.4.5/bits/stl_vector.h:132
#8 0x000000000048bf27 in std::vector<boost::shared_ptr<boost::uuids::uuid>, std::allocator<boost::shared_ptr<boost::uuids::uuid> > >::~vector (this=0x2aaaab2cea50,
__in_chrg=<value optimized out>) at /opt/local/bin/../lib/gcc/x86_64-unknown-linux-gnu/4.4.5/../../../../include/c++/4.4.5/bits/stl_vector.h:313
#9 0x0000003f1ee337fe in __cxa_finalize () from /lib64/libc.so.6
#10 0x00002aaaab052b36 in __do_global_dtors_aux ()
from /home/user/workspace/NewProject/build/components/lib_path/libhelper-d.so
#11 0x0000000000000000 in ?? ()
I really have no idea of how to proceed from here.
UPDATE I forgot to mention that the only global variable of the type which appears in the error is cleared m_uuids.size() == 0 by the time the error appear.
I had this same problem using glog. In my case, it was this scenario:
I had a share library, call it 'common.so' that linked glog.
My main executable, call it 'app' also linked glog, and linked in common.so.
The problem I had was that glog was linked statically in both the .so and the exectuable. When I changed both #1 and #2 to link the .so instead of the .a, the problem went away.
Not sure this is your problem, but it could be. Generally speaking, corruption when freeing up memory often means that you corrupted the memory pool (such as deleting the same pointer twice). I believe linking in the .a in both cases, I was getting cleanup behavior on the same global pointer (an std::string in my case) twice.
Update:
After much investigation, this is very likely the problem. What happens is that each the executable and the .so have a global variable of std::string type (part of glog). These std::string global variables must be constructed when the object (exe, .so) is loaded by the dynamic linker/loader. Also, a destructor for each is added for cleanup using at_exit. However, when it comes time for at_exit functions to be called, both global reference point to the same std::string. That means the std::string destructor is called twice, but on the same object. Then free is called on the same memory location twice. Global std::string (or any class with a constructor) are a bad idea. If you choose to have a .so based architecture (a good idea), you have to be careful with all 3rd party libraries and how they handle globals. You stay out of most danger by linking to the .so for all 3rd party libraries.
Where the error is appearing is probably a little misleading. My best guess would be that you've got a vector of shared pointers and as it's being destroyed, one (at least) of those shared pointers is trying to delete the object that it's pointing to, only to find that it has already been deleted.
Are you mixing raw pointers with shared pointers anywhere? If so, you might find a perfectly innocuous looking delete somewhere which is pulling the rug from under the feet of your shared_ptr

Get any line number of std:logic_error

My C++ program exits with a std::logic_error and I'd like to track down the source line that caused it. How can I do that?
TBH, I'm using gdb, using g++ -g in order to add debug info. All I can get are these messages:
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
Catchpoint 1 (exception thrown), 0x0045ffa0 in __cxa_throw ()
(gdb) bt
#0 0x0045ffa0 in __cxa_throw ()
#1 0x004601e8 in std::__throw_logic_error(char const*) ()
#2 0x00502238 in typeinfo for std::__timepunct<wchar_t> ()
#3 0x004685f8 in std::runtime_error::what() const ()
#4 0x03210da8 in ?? ()
#5 0x002efbcc in ?? ()
#6 0x00468734 in std::domain_error::~domain_error() ()
#7 0x00000000 in ?? ()
(gdb)
You use a debugger.
Using debugger tools is a very important skill to learn with compiled languages like C and C++.
The exception objects don't carry any source information with them. However, they hopefully contain a useful message accessible using the what() member. Other than that you'd either have to use a debugger allowing to break when exceptions are thrown or set a break point into the constructor of std::logic_error. As long as exceptions are exceptional this works OK. It doesn't work too well with code throwing exceptions in non-exceptional cases.

What caused the mysterious duplicate entry in my stack?

I'm investigating a deadlock bug. I took a core with gcore, and found that one of my functions seems to have called itself - even though it does not make a recursive function call.
Here's a fragment of the stack from gdb:
Thread 18 (Thread 4035926944 (LWP 23449)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x005133de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2 0x00510017 in _L_mutex_lock_182 () from /lib/tls/libpthread.so.0
#3 0x080d653c in ?? ()
#4 0xf7c59480 in ?? () from LIBFOO.so
#5 0x081944c0 in ?? ()
#6 0x081944b0 in ?? ()
#7 0xf08f3b38 in ?? ()
#8 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#9 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#10 0xf7c36006 in FOO::RequesterImpl::releaseObject ()
from LIBFOO.so
#11 0xf7e2afbf in BAR::BAZ::unsubscribe (this=0x80d0070, sSymbol=#0xf6ded018)
at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:176
...more stack
I've elided some of the names: FOO & BAR are namespaces.BAZ is a class.
The interesting part is #8 and #9, the call to Service::releaseObject(). This function does not call itself, nor does it call any function that calls it back... it is not recursive. Why then does it appear in the stack twice?
Is this an artefact created by the debugger, or could it be real?
You'll notice that the innermost call is waiting for a mutex - I think this could be my deadlock. Service::releaseObject() locks a mutex, so if it magically teleported back inside itself, then a deadlock most certainly could occur.
Some background:
This is compiled using g++ v3.4.6 on RHEL4. It's a 64-bit OS, but this is 32-bit code, compiled with -m32. It's optimised at -O3. I can't guarantee that the application code was compiled with exactly the same options as the LIBFOO code.
Class Service has no virtual functions, so there's no vtable. Class RequesterImpl inherits from a fully-virtual interface, so it does have a vtable.
Stacktraces are unreliable on x86 at any optimization level: -O1 and higher enable -fomit-frame-pointer.
The reason you get "bad" stack is that __lll_mutex_lock_wait has incorrect unwind descriptor (it is written in hand-coded assembly). I believe this was fixed somewhat recently (in 2008), but can't find the exact patch.
Once the GDB stack unwinder goes "off balance", it creates bogus frames (#2 through #8), but eventually stumbles on a frame which uses frame pointer, and produces correct stack trace for the rest of the stack.