I'm investigating a report of a deadlock that occurred within my library, which is generally mutli-threaded and written in C++11. The stacktrace during the deadlock looks like this:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
Id Target Id Frame
* 1 Thread 0x7fb40533b740 (LWP 26259) "i-foca" 0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
Thread 1 (Thread 0x7fb40533b740 (LWP 26259)):
#0 0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb4049dde76 in _L_lock_941 () from /lib64/libpthread.so.0
#2 0x00007fb4049ddd6f in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fb40403a0af in dl_iterate_phdr () from /lib64/libc.so.6
#4 0x00007fb3eb7f3bbf in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#5 0x00007fb3eb7f0d2c in ?? () from /lib64/libgcc_s.so.1
#6 0x00007fb3eb7f16ed in ?? () from /lib64/libgcc_s.so.1
#7 0x00007fb3eb7f1b7e in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8 0x00007fb3eba56986 in __cxa_throw () from /lib64/libstdc++.so.6
#9 0x00007fb3e7b3dd39 in <my library>
The code that causes the deadlock is basically throw NameError(...);, which is to say, a standard C++ construct which is supposed to be thread-safe. However, the code deadlocks nevertheless, trying to acquire a mutex in GLIBC's dl_iterate_phdr(). The following additional information is known about the environment:
Even though my library can spawn multiple threads, during the incident it ran in single-threaded mode, as evidenced by the stacktrace;
The program where my library is used does extensive forking-without-exec;
My library uses an at-fork handler in order to sanitize all its mutexes/threads when a fork occurs (however, I have no control over the mutexes in standard libraries). In particular, a fork cannot occur while an exception is being thrown.
I still don't understand how this deadlock could have occurred.
I'm considering the following scenarios, but not sure which one is possible and which one is not:
There are multiple child processes. One of them tries to throw an exception and crashes. If somehow the mutex that GLIBC uses is shared between child processes, and one of the children locks it but then fails to unlock because of the crash. Is it possible for a mutex to be shared in such a way?
Another library that I'm not aware of also uses multiple threads, and the fork happens when that library throws an exception in its code, which leaves the exception mutex in locked state in the child process. My library is then merely unfortunate enough to walk into this trap.
Any other scenario?
Related
I have used 2 threads, but they are getting stuck with following stack trace:
Thread 2:
(gdb) bt
#0 0x00007f9e1d7625bc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f9e1d6deb35 in _L_lock_17166 () from /lib64/libc.so.6
#2 0x00007f9e1d6dbb73 in malloc () from /lib64/libc.so.6
#3 0x00007f9e1d6c4bad in __fopen_internal () from /lib64/libc.so.6
#4 0x00007f9e1dda2210 in std::__basic_file<char>::open(char const*, std::_Ios_Openmode, int) () from /lib64/libstdc++.so.6
#5 0x00007f9e1dddd5ba in std::basic_filebuf<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) () from /lib64/libstdc++.so.6
#6 0x00000000005e1244 in fatalSignalHandler(int, siginfo*, void*) ()
#7 <signal handler called>
#8 0x00007f9e1d6d6839 in malloc_consolidate () from /lib64/libc.so.6
#9 0x00007f9e1d6d759e in _int_free () from /lib64/libc.so.6
_int_free is getting called as a result of default destructor.
Thread 1:
(gdb) bt
#0 0x00007f9e2a4ed54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f9e2a4e8e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f9e2a4e8d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
Via Threads getting stuck with few threads at point "in __lll_lock_wait" I get to know that __lll_lock_wait() is called if we are not able to get a lock on the mutex, since something else (In this case I guess the Thread 2) is still locking it.
But Thread 2 is also stuck with given stack trace, and since they are not with debug symbols, I can't check who is the owner of the mutex. So my questions are:
What is the use of / cause of __lll_lock_wait_private ()
Is there any hint what and where could the issue be? Without availability of debug symbols.
Several times I have seen hang in case of malloc_consolidate() on linux.. Is this a well known and yet to be solved issue?
Frames 6 and 7 of thread 2 suggest a custom signal handler was installed. Frame 5 suggests it is trying to do something like write to a file (std::ofstream?).
That is not allowed. Very little is allowed in signal handlers, and definitely not iostreams.
Suppose you are in a function like malloc_consolidate which may have to touch the global arena, and take a lock to do it, and a signal comes along. If you allocate memory in the signal handler, you also need the same lock, which is already being held. Thread 2 is deadlocking itself.
I'm writing c++11 multi-threaded application.
The main thread is reading from database and puts records in std::queue, threads are taking records from queue and process them.
Application is synchronised using std::mutex, std::condition_variable (defined as class members), methods are using std::unique_lock(class member mutex)
After some time (usually few minutes) - my application crashes with
terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff5464700 (LWP 10242)]
0x00007ffff60b3515 in raise () from /lib64/libc.so.6
backtrace from gdb shows:
#0 0x00007ffff60b3515 in raise () from /lib64/libc.so.6
#1 0x00007ffff60b498b in abort () from /lib64/libc.so.6
#2 0x00007ffff699f765 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#3 0x00007ffff699d906 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#4 0x00007ffff699d933 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#5 0x00007ffff69f0a75 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#6 0x00007ffff76741a7 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffff616a1fd in clone () from /lib64/libc.so.6
How can I get more information about this exception?
I am compiling and linking with -pthread option
G++ 4.8.3 at Gentoo Linux machine
-g option is enabled in both compiler and linker
I tried disabling optimization
I'm attempting to debug a rather complicated program that is seg faulting. I've just learned about gdb and am trying to use it to find the problem. Currently, it shows
[New Thread 0x7fff4963700 (LWP 4768)]
[New Thread 0x7fff1faf700 (LWP 4769)]
[New Thread 0x7fff17ae700 (LWP 4768)]
very shortly after my program commences. That would be great if I had written multithreaded code, but I haven't. Is there a way to tell exactly what line of code is creating these new threads?
Working on Linux, catch syscall clone should break on all threads (and possibly some processes) creation. Notice that it will break in the creator thread (=the new thread is yet to be started).
Since you get the full backtrace that leads to the clone, if you need to extract the new thread entry point you should do up until you reach the pthread_create (or similar library function) stack frame and take it from its parameters (you can also directly check the parameters to clone, but I fear that the address there will be of some pthread library stub).
Threads have their own call stack. The only thing you can see is the value on the bottom of the stack. Point the thread id in t <thread id> or thread <thread id> and get call stack using bt or backtrace. You may obtain thread ids during pausing execution of your application in gdb and running info threads.
For example, my gdb session look like (specially tried to make be more clear for you) this:
(gdb) t 23
[Switching to thread 23 (Thread 0x7fff8ffff700 (LWP 32334))]
#0 0x00007fffc0cb829e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
(gdb) bt
#0 0x00007fffc0cb829e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1 0x00007fffc0cb5bb0 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2 0x00007ffff52b10a5 in start_thread (arg=0x7fff8ffff700) at pthread_create.c:309
#3 0x00007ffff591a88d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Here gdb says that first value of the call stack is somewhere in libgomp.so (OpenMP library). Next you can see pthread_create.c which is system-dependent method of starting thread.
I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.
I have a program that brings up and tears down multiple threads throughout its life. Everything works great for awhile, but eventually, I get the following core dump stack trace.
#0 0x009887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x007617a5 in raise () from /lib/tls/libc.so.6
#2 0x00763209 in abort () from /lib/tls/libc.so.6
#3 0x003ec1bb in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x003e9ed1 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.6
#5 0x003e9f06 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x003ea04f in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0x00d5562b in boost::thread::start_thread () from /h/Program/bin/../lib/libboost_thread-gcc34-mt-1_39.so.1.39.0
At first, I was leaking threads, and figured the core was due to hitting some maximum limit of number of current threads, but now it seems that this problems occurs even when I don't. For reference, in the core above there were 13 active threads executing.
I did some searching to try and figure out why start_thread would core, but I didn't come across anything. Anyone have any ideas?
start_thread is throwing an uncaught exception, see which exceptions can start_thread throw and place a catch around it to see what is the problem.
What are the values carried by thread_resource_error? It looks like you can call native_error() to find out.
Since this is a wrapper around pthreads there are only a couple of possibilities - EAGAIN, EINVAL and EPERM. It looks as if boost has exceptions it would likely throw for EINVAL and EPERM - i.e. unsupported_thread_option() and thread_permission_error().
That pretty much leaves EAGAIN so I would double check that you really aren't exceeding the system limits on the number of threads. You are sure you are joining them, or if detached, they are really gone?