I'm debugging C++ deadlocks with gdb. Inevitably all the deadlock frames are deep inside library functions. For example, __lll_lock_wait is found several layers under mutex::lock(). It would be really great if I could just glance at the thread list to deduce where the deadlock is happening instead of going through the backtrace for each thread.
Example
main.cpp
In this example I purposely introduce deadlock by trying to lock the same mutex twice.
#include <mutex>
#include <thread>
void lock_a(){
std::mutex a;
a.lock();
// attempt to lock something that's already locked
a.lock();
}
int main(){
std::thread t_a(lock_a);
t_a.join();
return 0;
}
compilation & running
clang++ -g main.cpp -lpthread -o main.exe
gdb -ex run --args ./main.exe
gdb
Here I'm trying to figure out what caused the deadlock. If info threads could somehow prioritize "local" files I might find the cause much faster. At long last I'm able to see that I'm trying to lock a mutex in main.cpp
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7a41740 (LWP 2651239) "main.exe" __pthread_clockjoin_ex (threadid=140737348110080, thread_return=0x0,
clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
2 Thread 0x7ffff7a40700 (LWP 2651293) "main.exe" __lll_lock_wait (futex=futex#entry=0x7ffff7a3fe08, private=0) at lowlevellock.c:52
(gdb) t 2
[Switching to thread 2 (Thread 0x7ffff7a40700 (LWP 2651293))]
#0 __lll_lock_wait (futex=futex#entry=0x7ffff7a3fe08, private=0) at lowlevellock.c:52
52 lowlevellock.c: No such file or directory.
(gdb) f 4
#4 0x0000000000401223 in lock_a () at main.cpp:8
8 a.lock();
Is there some way to have gdb prioritize certain files when listing threads?
Related
I'm investigating a report of a deadlock that occurred within my library, which is generally mutli-threaded and written in C++11. The stacktrace during the deadlock looks like this:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
Id Target Id Frame
* 1 Thread 0x7fb40533b740 (LWP 26259) "i-foca" 0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
Thread 1 (Thread 0x7fb40533b740 (LWP 26259)):
#0 0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb4049dde76 in _L_lock_941 () from /lib64/libpthread.so.0
#2 0x00007fb4049ddd6f in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fb40403a0af in dl_iterate_phdr () from /lib64/libc.so.6
#4 0x00007fb3eb7f3bbf in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#5 0x00007fb3eb7f0d2c in ?? () from /lib64/libgcc_s.so.1
#6 0x00007fb3eb7f16ed in ?? () from /lib64/libgcc_s.so.1
#7 0x00007fb3eb7f1b7e in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8 0x00007fb3eba56986 in __cxa_throw () from /lib64/libstdc++.so.6
#9 0x00007fb3e7b3dd39 in <my library>
The code that causes the deadlock is basically throw NameError(...);, which is to say, a standard C++ construct which is supposed to be thread-safe. However, the code deadlocks nevertheless, trying to acquire a mutex in GLIBC's dl_iterate_phdr(). The following additional information is known about the environment:
Even though my library can spawn multiple threads, during the incident it ran in single-threaded mode, as evidenced by the stacktrace;
The program where my library is used does extensive forking-without-exec;
My library uses an at-fork handler in order to sanitize all its mutexes/threads when a fork occurs (however, I have no control over the mutexes in standard libraries). In particular, a fork cannot occur while an exception is being thrown.
I still don't understand how this deadlock could have occurred.
I'm considering the following scenarios, but not sure which one is possible and which one is not:
There are multiple child processes. One of them tries to throw an exception and crashes. If somehow the mutex that GLIBC uses is shared between child processes, and one of the children locks it but then fails to unlock because of the crash. Is it possible for a mutex to be shared in such a way?
Another library that I'm not aware of also uses multiple threads, and the fork happens when that library throws an exception in its code, which leaves the exception mutex in locked state in the child process. My library is then merely unfortunate enough to walk into this trap.
Any other scenario?
I am supporting an application written in C++ over many years and as of late it has started to crash providing core dumps that we don't know how to handle.
It runs on an appliance on Ubuntu 14.04.5
When loading the core file in GDB it says that:
Program terminated with signal SIGABRT, Aborted
I can inspect 230 threads but they are all in wait() in the exact same memory position.
There is a thread with ID 1 that in theory could be the responsible but that thread is also in wait.
So I have two questions basically.
How does the id index of the threads work?
Is thread with GDB ID 1 the last active thread? or is that an arbitrary index and the failure can be in any of the other threads?
How can all threads be in wait() when a SIGABRT is triggered?
Shouldn't the instruction pointer be at the failing command when the OS decided to step in an halt the process? Or is it some sort of deadlock protection?
Any help much appreciated.
Backtrace of thread 1:
#0 0xf771dcd9 in ?? ()
#1 0xf74ad4ca in _int_free (av=0x38663364, p=<optimized out>,have_lock=-186161432) at malloc.c:3989
#2 0xf76b41ab in std::string::_Rep::_M_destroy(std::allocator<char> const&) () from /usr/lib32/libstdc++.so.6
#3 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#4 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#5 0x5685e8b4 in SlimStringMapper::~SlimStringMapper() ()
#6 0x567d6bc3 in destroy ()
#7 0x566a40b4 in HttpProxy::getLogonCredentials(HttpClient*, HttpServerTransaction*, std::string const&, std::string const&, std::string&, std::string&) ()
#8 0x566a5d04 in HttpProxy::add_authorization_header(HttpClient*, HttpServerTransaction*, Hosts::Host*) ()
#9 0x566af97c in HttpProxy::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#10 0x566d597e in callOnClientRequest(HttpClient*, HttpServerTransaction*, FastHttpRequest*) ()
#11 0x566d169f in GateKeeper::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#12 0x566a2291 in HttpClientThread::run() ()
#13 0x5682e37c in wa_run_thread ()
#14 0xf76f6f72 in start_thread (arg=0xec65ab40) at pthread_create.c:312
#15 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#16 0xec65ab40 in ?? ()
Another thread that should be in wait:
#0 0xf771dcd9 in ?? ()
#1 0x5682e37c in wa_run_thread ()
#2 0xf76f6f72 in start_thread (arg=0xf33bdb40) at pthread_create.c:312
#3 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#4 0xf33bdb40 in ?? ()
Best regards
Jon
How can all threads be in wait() when a SIGABRT is triggered?
Is wait the POSIX function, or something from the run-time environment? Are you looking at a higher-level backtrace?
Anyway, there is an easy explanation why this can happen: SIGABRT was sent to the process, and not generated by a thread in a synchronous fashion. Perhaps a coworker sent the signal to create the coredump, after observing the deadlock, to collect evidence for future analysis?
How does the id index of the threads work? Is thread with GDB ID 1 the last active thread?
When the program is running under GDB, GDB numbers threads as it discovers them, so thread 1 is always the main thread.
But when loading a core dump, GDB discoveres threads in the order in which the kernel saved them. The kernels that I have seen always save the thread which caused program termination first, so usually loading core into GDB immediately gets you to the crash point without the need to switch threads.
How can all threads be in wait() when a SIGABRT is triggered?
One possiblity is that you are not analyzing the core correctly. In particular, you need exact copies of shared libraries that were used at the time when the core was produced, and that's unlikely to be the case when the application runs on "appliance" and you are analysing core on your development machine. See this answer.
I just saw your question. First of all my answer is not specific to you direct question but some solution to handle this kind of situation. Multi-threading entirely depend on the hardware and operating system of a machine. Especially memory and processors. Increase in thread means requirement of more memory as well as more time slice for processor. I don’t think your application have more than 100 processor to facilitate 230 thread to run concurrently with highest performance. To avoid this situation do the below steps which may help you.
Control the creation of threads. Control number of threads running concurrently.
Increase the memory size of your application. (check compiler options to increase memory for the application at run time or O/S to allocate enough memory)
Set grid size and stack size of each thread properly. (calculation need to be done based on your application’s threads functionality, this is bit complicated. Please read some documentation)
Handle synchronized block properly to avoid any deadlock.
Where necessary use conditional lock etc.
As you told that most of your threads are in wait condition, that means they are waiting for a lock to release for their turn, that means one of the thread already acquire the lock and still busy in processing or probably in deadlock situation.
I have a C++ application that starts as single thread and processes some video frames. For each frame the application spawns 2 threads that join and this is done in a loop for each frame.
I'm trying to investigate whether there is another thread that I haven't detected. The application is quite complex and loads shared libraries that may spawn threads of their own.
I use gdb's info threads for that.
This is what I get:
Id Target Id Frame
7 Thread 0x7fffde7fc700 (LWP 16644) "my_debugged_process" sem_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
6 Thread 0x7fffdeffd700 (LWP 16643) "my_debugged_process" sem_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
5 Thread 0x7fffdf7fe700 (LWP 16642) "my_debugged_process" sem_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
4 Thread 0x7fffdffff700 (LWP 16641) "my_debugged_process" sem_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
3 Thread 0x7fffe4988700 (LWP 16640) "my_debugged_process" sem_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
2 Thread 0x7fffe5c0b700 (LWP 16639) "my_debugged_process" 0x00007ffff3dc812d in poll ()
at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7ffff7fc2800 (LWP 16636) "my_debugged_process" TheApplication::SomeClass::processFrame (this=0x743530, srcI=...,
dstI=...) at TheApplication.cpp:315
So the question is:
What are the threads from 2 to 7? Are they somehow related to my process? I only recognize thread 1.
I see that they are all waiting for a semaphore so I'm inclined to say that they belong to the debugger.
First, what Jonathan said in the comments: on Linux, gdb does not create any threads in your process. gdb tries to have a reasonably minimal impact on your application -- it can't be zero but it is pretty close.
Second, what Jonathan said again: to try to understand the threads after they are running, get a backtrace and see if it makes any sense. For a single thread:
(gdb) thread 52 # e.g.
(gdb) bt
Or for all of them:
(gdb) thread apply all bt
Finally, to see threads when they are created, one way to try is to get a backtrace when the thread is started:
(gdb) break pthread_create
(gdb) commands
> bt
> cont
> end
This should print a stack trace when a thread is created. This won't necessarily work in all cases (some programs call clone directly) but it should be ok for well-behaved libraries.
I'm writing c++11 multi-threaded application.
The main thread is reading from database and puts records in std::queue, threads are taking records from queue and process them.
Application is synchronised using std::mutex, std::condition_variable (defined as class members), methods are using std::unique_lock(class member mutex)
After some time (usually few minutes) - my application crashes with
terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff5464700 (LWP 10242)]
0x00007ffff60b3515 in raise () from /lib64/libc.so.6
backtrace from gdb shows:
#0 0x00007ffff60b3515 in raise () from /lib64/libc.so.6
#1 0x00007ffff60b498b in abort () from /lib64/libc.so.6
#2 0x00007ffff699f765 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#3 0x00007ffff699d906 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#4 0x00007ffff699d933 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#5 0x00007ffff69f0a75 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#6 0x00007ffff76741a7 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffff616a1fd in clone () from /lib64/libc.so.6
How can I get more information about this exception?
I am compiling and linking with -pthread option
G++ 4.8.3 at Gentoo Linux machine
-g option is enabled in both compiler and linker
I tried disabling optimization
I'm attempting to debug a rather complicated program that is seg faulting. I've just learned about gdb and am trying to use it to find the problem. Currently, it shows
[New Thread 0x7fff4963700 (LWP 4768)]
[New Thread 0x7fff1faf700 (LWP 4769)]
[New Thread 0x7fff17ae700 (LWP 4768)]
very shortly after my program commences. That would be great if I had written multithreaded code, but I haven't. Is there a way to tell exactly what line of code is creating these new threads?
Working on Linux, catch syscall clone should break on all threads (and possibly some processes) creation. Notice that it will break in the creator thread (=the new thread is yet to be started).
Since you get the full backtrace that leads to the clone, if you need to extract the new thread entry point you should do up until you reach the pthread_create (or similar library function) stack frame and take it from its parameters (you can also directly check the parameters to clone, but I fear that the address there will be of some pthread library stub).
Threads have their own call stack. The only thing you can see is the value on the bottom of the stack. Point the thread id in t <thread id> or thread <thread id> and get call stack using bt or backtrace. You may obtain thread ids during pausing execution of your application in gdb and running info threads.
For example, my gdb session look like (specially tried to make be more clear for you) this:
(gdb) t 23
[Switching to thread 23 (Thread 0x7fff8ffff700 (LWP 32334))]
#0 0x00007fffc0cb829e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
(gdb) bt
#0 0x00007fffc0cb829e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1 0x00007fffc0cb5bb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2 0x00007ffff52b10a5 in start_thread (arg=0x7fff8ffff700) at pthread_create.c:309
#3 0x00007ffff591a88d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Here gdb says that first value of the call stack is somewhere in libgomp.so (OpenMP library). Next you can see pthread_create.c which is system-dependent method of starting thread.