I'm investigating a deadlock bug. I took a core with gcore, and found that one of my functions seems to have called itself - even though it does not make a recursive function call.
Here's a fragment of the stack from gdb:
Thread 18 (Thread 4035926944 (LWP 23449)):
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x005133de in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#2 0x00510017 in _L_mutex_lock_182 () from /lib/tls/libpthread.so.0
#3 0x080d653c in ?? ()
#4 0xf7c59480 in ?? () from LIBFOO.so
#5 0x081944c0 in ?? ()
#6 0x081944b0 in ?? ()
#7 0xf08f3b38 in ?? ()
#8 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#9 0xf7c3b34c in FOO::Service::releaseObject ()
from LIBFOO.so
#10 0xf7c36006 in FOO::RequesterImpl::releaseObject ()
from LIBFOO.so
#11 0xf7e2afbf in BAR::BAZ::unsubscribe (this=0x80d0070, sSymbol=#0xf6ded018)
at /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../include/c++/3.4.6/bits/stl_tree.h:176
...more stack
I've elided some of the names: FOO & BAR are namespaces.BAZ is a class.
The interesting part is #8 and #9, the call to Service::releaseObject(). This function does not call itself, nor does it call any function that calls it back... it is not recursive. Why then does it appear in the stack twice?
Is this an artefact created by the debugger, or could it be real?
You'll notice that the innermost call is waiting for a mutex - I think this could be my deadlock. Service::releaseObject() locks a mutex, so if it magically teleported back inside itself, then a deadlock most certainly could occur.
Some background:
This is compiled using g++ v3.4.6 on RHEL4. It's a 64-bit OS, but this is 32-bit code, compiled with -m32. It's optimised at -O3. I can't guarantee that the application code was compiled with exactly the same options as the LIBFOO code.
Class Service has no virtual functions, so there's no vtable. Class RequesterImpl inherits from a fully-virtual interface, so it does have a vtable.
Stacktraces are unreliable on x86 at any optimization level: -O1 and higher enable -fomit-frame-pointer.
The reason you get "bad" stack is that __lll_mutex_lock_wait has incorrect unwind descriptor (it is written in hand-coded assembly). I believe this was fixed somewhat recently (in 2008), but can't find the exact patch.
Once the GDB stack unwinder goes "off balance", it creates bogus frames (#2 through #8), but eventually stumbles on a frame which uses frame pointer, and produces correct stack trace for the rest of the stack.
Related
I am writing a program which use both OS threads and user threads (fibers, I have written this user Threading program with context switching through assembly language). The problem is that the program sometimes ends with a segmentation fault but other times it doesn't.
The problem is due to a function getting called, with invalid arguments, which shouldn't get called. I think gdb backtrace isn't giving the proper information. Here is the output of my gdb program
#0 0x0000000000000000 in ?? ()
#1 0x0000555555555613 in thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
#2 0x000055555555c791 in start_thread () at contextSwitch2.s:57
#3 0x0000000000000000 in ?? ()
fn is the function I want to run as a user thread, arg is the argument passed to that function.
I have a function Spawn in my user threading library code which push the two arguments (fn and arg) and the pointer to start_thread on the stack and thus start_thread, an assembly function, gets called which call the c++ function thread_entry to call the function fn with arguments arg.
I am not expecting a call to start_thread or thread_entry at the point of error so I am not sure how start_thread gets called. Even if it gets called then Spawn() should have called start_thread as it is the only function which calls start_thread. But Spawn is not shown in gdb backtrace.
Some online posts have mentioned the possibility of stack corruption or something similar the result of error and they have prescribed the use of "record btrace pt". I have spent considerable time setting up intel btrace pt support in the kernel/gdb but I was unable to set it up so I am not going through that route.
Here is a link to my code with compilation instructions:
https://github.com/smartWaqar/userThreading
I set a breakpoint on thread_entry, and observed:
...
[Thread 0x7ffff7477700 (LWP 203995) exited]
parentId: 1
OST 1 Hello A0 on CPU 2
current_thread_num 0 next_thread_num 1
After Thread Exit
After changeOSThread
OST 1 Hello C1 on CPU 2 ---------------
Before changeOSThread
**************** In changeOSThread **************
current_thread_num 1 next_thread_num 2
Thread 3 "a.out" hit Breakpoint 1, thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
243 fn(arg) ;
(gdb) bt
#0 thread_entry (fn=0x0, arg=0x0) at userThread2.cpp:243
#1 0x000055555555c181 in start_thread () at context.s:57
#2 0x0000000000000000 in ?? ()
Conclusions:
GDB is giving you correct crash stack trace.
You do in fact call thread_entry with fn==0, which of course promptly crashes.
There is something racy going on, as this doesn't happen every time.
Even if it gets called then Spawn() should have called start_thread as it is the only function which calls start_thread
I've observed the following "call" to strart_thread:
Thread 2 "a.out" hit Breakpoint 1, start_thread () at context.s:53
53 push %rbp
(gdb) bt
#0 start_thread () at context.s:53
#1 0x0000555555555e4f in changeOSThread (parentId=<error reading variable>) at t.cc:196
#2 0x0000000000000000 in ?? ()
So I think your mental model of who calls start_thread is wrong.
This is a bit too much code for me to look at. If you want additional help, please reduce the test case to bare minimum.
I am supporting an application written in C++ over many years and as of late it has started to crash providing core dumps that we don't know how to handle.
It runs on an appliance on Ubuntu 14.04.5
When loading the core file in GDB it says that:
Program terminated with signal SIGABRT, Aborted
I can inspect 230 threads but they are all in wait() in the exact same memory position.
There is a thread with ID 1 that in theory could be the responsible but that thread is also in wait.
So I have two questions basically.
How does the id index of the threads work?
Is thread with GDB ID 1 the last active thread? or is that an arbitrary index and the failure can be in any of the other threads?
How can all threads be in wait() when a SIGABRT is triggered?
Shouldn't the instruction pointer be at the failing command when the OS decided to step in an halt the process? Or is it some sort of deadlock protection?
Any help much appreciated.
Backtrace of thread 1:
#0 0xf771dcd9 in ?? ()
#1 0xf74ad4ca in _int_free (av=0x38663364, p=<optimized out>,have_lock=-186161432) at malloc.c:3989
#2 0xf76b41ab in std::string::_Rep::_M_destroy(std::allocator<char> const&) () from /usr/lib32/libstdc++.so.6
#3 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#4 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#5 0x5685e8b4 in SlimStringMapper::~SlimStringMapper() ()
#6 0x567d6bc3 in destroy ()
#7 0x566a40b4 in HttpProxy::getLogonCredentials(HttpClient*, HttpServerTransaction*, std::string const&, std::string const&, std::string&, std::string&) ()
#8 0x566a5d04 in HttpProxy::add_authorization_header(HttpClient*, HttpServerTransaction*, Hosts::Host*) ()
#9 0x566af97c in HttpProxy::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#10 0x566d597e in callOnClientRequest(HttpClient*, HttpServerTransaction*, FastHttpRequest*) ()
#11 0x566d169f in GateKeeper::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#12 0x566a2291 in HttpClientThread::run() ()
#13 0x5682e37c in wa_run_thread ()
#14 0xf76f6f72 in start_thread (arg=0xec65ab40) at pthread_create.c:312
#15 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#16 0xec65ab40 in ?? ()
Another thread that should be in wait:
#0 0xf771dcd9 in ?? ()
#1 0x5682e37c in wa_run_thread ()
#2 0xf76f6f72 in start_thread (arg=0xf33bdb40) at pthread_create.c:312
#3 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#4 0xf33bdb40 in ?? ()
Best regards
Jon
How can all threads be in wait() when a SIGABRT is triggered?
Is wait the POSIX function, or something from the run-time environment? Are you looking at a higher-level backtrace?
Anyway, there is an easy explanation why this can happen: SIGABRT was sent to the process, and not generated by a thread in a synchronous fashion. Perhaps a coworker sent the signal to create the coredump, after observing the deadlock, to collect evidence for future analysis?
How does the id index of the threads work? Is thread with GDB ID 1 the last active thread?
When the program is running under GDB, GDB numbers threads as it discovers them, so thread 1 is always the main thread.
But when loading a core dump, GDB discoveres threads in the order in which the kernel saved them. The kernels that I have seen always save the thread which caused program termination first, so usually loading core into GDB immediately gets you to the crash point without the need to switch threads.
How can all threads be in wait() when a SIGABRT is triggered?
One possiblity is that you are not analyzing the core correctly. In particular, you need exact copies of shared libraries that were used at the time when the core was produced, and that's unlikely to be the case when the application runs on "appliance" and you are analysing core on your development machine. See this answer.
I just saw your question. First of all my answer is not specific to you direct question but some solution to handle this kind of situation. Multi-threading entirely depend on the hardware and operating system of a machine. Especially memory and processors. Increase in thread means requirement of more memory as well as more time slice for processor. I don’t think your application have more than 100 processor to facilitate 230 thread to run concurrently with highest performance. To avoid this situation do the below steps which may help you.
Control the creation of threads. Control number of threads running concurrently.
Increase the memory size of your application. (check compiler options to increase memory for the application at run time or O/S to allocate enough memory)
Set grid size and stack size of each thread properly. (calculation need to be done based on your application’s threads functionality, this is bit complicated. Please read some documentation)
Handle synchronized block properly to avoid any deadlock.
Where necessary use conditional lock etc.
As you told that most of your threads are in wait condition, that means they are waiting for a lock to release for their turn, that means one of the thread already acquire the lock and still busy in processing or probably in deadlock situation.
Yesterday I ran into misery which took me 24 hours of frustration. The problem boiled down to unexpected crashes occurring on random basis. To complicate things, debugging reports had absolutely random pattern as well. To complicate it even more, all debugging traces were leading to either random Qt sources or native DLLs, i.e. proving every time that the issue is rather not on my side.
Here you are a few examples of such lovely reports:
Program received signal SIGSEGV, Segmentation fault.
0x0000000077864324 in ntdll!RtlAppendStringToString () from C:\Windows\system32\ntdll.dll
(gdb) bt
#0 0x0000000077864324 in ntdll!RtlAppendStringToString () from C:\Windows\system32\ntdll.dll
#1 0x000000002efc0230 in ?? ()
#2 0x0000000002070005 in ?? ()
#3 0x000000002efc0000 in ?? ()
#4 0x000000007787969f in ntdll!RtlIsValidHandle () from C:\Windows\system32\ntdll.dll
#5 0x0000000000000000 in ?? ()
warning: HEAP: Free Heap block 307e5950 modified at 307e59c0 after it was freed
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000778bf0b2 in ntdll!ExpInterlockedPopEntrySListFault16 () from C:\Windows\system32\ntdll.dll
(gdb) bt
#0 0x00000000778bf0b2 in ntdll!ExpInterlockedPopEntrySListFault16 () from C:\Windows\system32\ntdll.dll
#1 0x000000007786fd34 in ntdll!RtlIsValidHandle () from C:\Windows\system32\ntdll.dll
#2 0x0000000077910d20 in ntdll!RtlGetLastNtStatus () from C:\Windows\system32\ntdll.dll
#3 0x00000000307e5950 in ?? ()
#4 0x00000000307e59c0 in ?? ()
#5 0x00000000ffffffff in ?? ()
#6 0x0000000000220f10 in ?? ()
#7 0x0000000077712d60 in WaitForMultipleObjectsEx () from C:\Windows\system32\kernel32.dll
#8 0x0000000000000000 in ?? ()
Program received signal SIGSEGV, Segmentation fault.
0x0000000000a9678a in QBasicAtomicInt::ref (this=0x8) at ../../include/QtCore/../../../qt-src/src/corelib/arch/qatomic_x86_64.h:121
121 : "memory");
(gdb) bt
#0 0x0000000000a9678a in QBasicAtomicInt::ref (this=0x8) at ../../include/QtCore/../../../qt-src/src/corelib/arch/qatomic_x86_64.h:121
#1 0x00000000009df08e in QVariant::QVariant (this=0x21e4d0, p=...) at d:/Distributions/qt-src/src/corelib/kernel/qvariant.cpp:1426
#2 0x0000000000b4dde9 in QList<QVariant>::value (this=0x323bd480, i=1) at ../../include/QtCore/../../../qt-src/src/corelib/tools/qlist.h:666
#3 0x00000000009ccff7 in QObject::property (this=0x3067e900,
name=0xa9d042a <QCDEStyle::drawPrimitive(QStyle::PrimitiveElement, QStyleOption const*, QPainter*, QWidget const*) const::pts5+650> "_q_stylerect")
at d:/Distributions/qt-src/src/corelib/kernel/qobject.cpp:3742
#4 0x0000000000000000 in ?? ()
As you can see this stuff is pretty nasty, it gives one no useful information. But, there was one thing I didn't pay attention to. It was a weird warning during compilation which is also hard to catch with an eye:
In file included from d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer.h:50:0,
from d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/QSharedPointer:1,
from ../../../../source/libraries/Project/sources/Method.hpp:4,
from ../../../../source/libraries/Project/sources/Slot.hpp:4,
from ../../../../source/libraries/Project/sources/Slot.cpp:1:
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h: In instantiation of 'static void QtSharedPointer::ExternalRefCount<T>::deref(QtSharedPointer::ExternalRefCount<T>::Data*, T*) [with T = Project::Method::Private; QtSharedPointer::ExternalRefCount<T>::Data = QtSharedPointer::ExternalRefCountData]':
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:336:11: required from 'void QtSharedPointer::ExternalRefCount<T>::deref() [with T = Project::Method::Private]'
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:401:38: required from 'QtSharedPointer::ExternalRefCount<T>::~ExternalRefCount() [with T = Project::Method::Private]'
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:466:7: required from here
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:342:21: warning: possible problem detected in invocation of delete operator: [enabled by default]
d:/Libraries/x64/MinGW-w64/4.7.2/Qt/4.8.4/include/QtCore/qsharedpointer_impl.h:337:28: warning: 'value' has incomplete type [enabled by default]
Actually, I turned to this warning only as a last resort because in such a desperate pursuit to find a bug, the code was already infected with logging to death literally.
After reading it carefully, I recalled that, for instance, if one uses std::unique_ptr or std::scoped_ptr for Pimpl - one should certainly provide desctructor, otherwise the code won't even compile. However, I also remember that std::shared_ptr does not care about destructor and works fine without it. It was another reason why I didn't pay attention to this strange warning. Long story short, when I added destructor, this random crashing stopped. Looks like Qt's QSharedPointer has some design flaws compared to std::shared_ptr. I guess it would be better, if Qt developers transformed this warning into error because debugging marathons like that are simply not worth one's time, effort and nerves.
My questions are:
What's wrong with QSharedPointer? Why destructor is so vital?
Why crashing happened when there was no destructor? These objects (which are using Pimpl + QSharedPointer) are created on stack and no other objects have access to them after their death. However, crashing happened during some random period of time after their death.
Has anyone ran into issues like that before? Please, share your experience.
Are there other pitfalls
like that in Qt - ones that I must know about for sure to stay
safe in future?
Hopefully, these questions and my post in general will help others to avoid the hell I've been to for the past 24 hours.
The issue has been worked around in Qt 5, see https://codereview.qt-project.org/#change,26974
The compiler calling the wrong destructor or assuming a different memory layout probably lead to some kind of memory corruption. I'd say a compiler should give an error for this issue and not a warning.
You'll run into a similar issue with std::unique_ptr, which can also cause broken destructors if used with an incomplete type. The fix is pretty trivial, of course - I declare a constructor for the class, then define it in the implementation file as
MyClass::~MyClass() = default;
The reason that this is an issue for std::unique_ptr but not std::shared_ptr is that the destructor is part of the type of the former, but is a member of the latter.
I'm having problems with C++ code loaded via dlopen() by a C++ CGI server. After a while, the program crashes unexpectedly, but consistently at memory management function call (such as free(), calloc(), etc.) and produces core dump similar to this:
#0 0x0000000806b252dc in kill () from /lib/libc.so.6
#1 0x0000000804a1861e in raise () from /lib/libpthread.so.2
#2 0x0000000806b2416d in abort () from /lib/libc.so.6
#3 0x0000000806abdb45 in _UTF8_init () from /lib/libc.so.6
#4 0x0000000806abdfcc in _UTF8_init () from /lib/libc.so.6
#5 0x0000000806abeb1d in _UTF8_init () from /lib/libc.so.6
... the rest of the stack
Has anyone seen something like this before?
What is _UTF8_init() and why would memory management functions call it?
That smells like a corrupted heap, likely due to a buffer overrun somewhere in your code. Try running your program with Valgrind and look for any errors or warnings it emits.
I have a program that brings up and tears down multiple threads throughout its life. Everything works great for awhile, but eventually, I get the following core dump stack trace.
#0 0x009887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x007617a5 in raise () from /lib/tls/libc.so.6
#2 0x00763209 in abort () from /lib/tls/libc.so.6
#3 0x003ec1bb in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x003e9ed1 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.6
#5 0x003e9f06 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x003ea04f in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0x00d5562b in boost::thread::start_thread () from /h/Program/bin/../lib/libboost_thread-gcc34-mt-1_39.so.1.39.0
At first, I was leaking threads, and figured the core was due to hitting some maximum limit of number of current threads, but now it seems that this problems occurs even when I don't. For reference, in the core above there were 13 active threads executing.
I did some searching to try and figure out why start_thread would core, but I didn't come across anything. Anyone have any ideas?
start_thread is throwing an uncaught exception, see which exceptions can start_thread throw and place a catch around it to see what is the problem.
What are the values carried by thread_resource_error? It looks like you can call native_error() to find out.
Since this is a wrapper around pthreads there are only a couple of possibilities - EAGAIN, EINVAL and EPERM. It looks as if boost has exceptions it would likely throw for EINVAL and EPERM - i.e. unsupported_thread_option() and thread_permission_error().
That pretty much leaves EAGAIN so I would double check that you really aren't exceeding the system limits on the number of threads. You are sure you are joining them, or if detached, they are really gone?