I'm writing c++11 multi-threaded application.
The main thread is reading from database and puts records in std::queue, threads are taking records from queue and process them.
Application is synchronised using std::mutex, std::condition_variable (defined as class members), methods are using std::unique_lock(class member mutex)
After some time (usually few minutes) - my application crashes with
terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff5464700 (LWP 10242)]
0x00007ffff60b3515 in raise () from /lib64/libc.so.6
backtrace from gdb shows:
#0 0x00007ffff60b3515 in raise () from /lib64/libc.so.6
#1 0x00007ffff60b498b in abort () from /lib64/libc.so.6
#2 0x00007ffff699f765 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#3 0x00007ffff699d906 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#4 0x00007ffff699d933 in std::terminate() () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#5 0x00007ffff69f0a75 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.3/libstdc++.so.6
#6 0x00007ffff76741a7 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffff616a1fd in clone () from /lib64/libc.so.6
How can I get more information about this exception?
I am compiling and linking with -pthread option
G++ 4.8.3 at Gentoo Linux machine
-g option is enabled in both compiler and linker
I tried disabling optimization
Related
I'm investigating a report of a deadlock that occurred within my library, which is generally mutli-threaded and written in C++11. The stacktrace during the deadlock looks like this:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
Id Target Id Frame
* 1 Thread 0x7fb40533b740 (LWP 26259) "i-foca" 0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
Thread 1 (Thread 0x7fb40533b740 (LWP 26259)):
#0 0x00007fb4049e250d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb4049dde76 in _L_lock_941 () from /lib64/libpthread.so.0
#2 0x00007fb4049ddd6f in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fb40403a0af in dl_iterate_phdr () from /lib64/libc.so.6
#4 0x00007fb3eb7f3bbf in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#5 0x00007fb3eb7f0d2c in ?? () from /lib64/libgcc_s.so.1
#6 0x00007fb3eb7f16ed in ?? () from /lib64/libgcc_s.so.1
#7 0x00007fb3eb7f1b7e in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8 0x00007fb3eba56986 in __cxa_throw () from /lib64/libstdc++.so.6
#9 0x00007fb3e7b3dd39 in <my library>
The code that causes the deadlock is basically throw NameError(...);, which is to say, a standard C++ construct which is supposed to be thread-safe. However, the code deadlocks nevertheless, trying to acquire a mutex in GLIBC's dl_iterate_phdr(). The following additional information is known about the environment:
Even though my library can spawn multiple threads, during the incident it ran in single-threaded mode, as evidenced by the stacktrace;
The program where my library is used does extensive forking-without-exec;
My library uses an at-fork handler in order to sanitize all its mutexes/threads when a fork occurs (however, I have no control over the mutexes in standard libraries). In particular, a fork cannot occur while an exception is being thrown.
I still don't understand how this deadlock could have occurred.
I'm considering the following scenarios, but not sure which one is possible and which one is not:
There are multiple child processes. One of them tries to throw an exception and crashes. If somehow the mutex that GLIBC uses is shared between child processes, and one of the children locks it but then fails to unlock because of the crash. Is it possible for a mutex to be shared in such a way?
Another library that I'm not aware of also uses multiple threads, and the fork happens when that library throws an exception in its code, which leaves the exception mutex in locked state in the child process. My library is then merely unfortunate enough to walk into this trap.
Any other scenario?
I have used 2 threads, but they are getting stuck with following stack trace:
Thread 2:
(gdb) bt
#0 0x00007f9e1d7625bc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f9e1d6deb35 in _L_lock_17166 () from /lib64/libc.so.6
#2 0x00007f9e1d6dbb73 in malloc () from /lib64/libc.so.6
#3 0x00007f9e1d6c4bad in __fopen_internal () from /lib64/libc.so.6
#4 0x00007f9e1dda2210 in std::__basic_file<char>::open(char const*, std::_Ios_Openmode, int) () from /lib64/libstdc++.so.6
#5 0x00007f9e1dddd5ba in std::basic_filebuf<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) () from /lib64/libstdc++.so.6
#6 0x00000000005e1244 in fatalSignalHandler(int, siginfo*, void*) ()
#7 <signal handler called>
#8 0x00007f9e1d6d6839 in malloc_consolidate () from /lib64/libc.so.6
#9 0x00007f9e1d6d759e in _int_free () from /lib64/libc.so.6
_int_free is getting called as a result of default destructor.
Thread 1:
(gdb) bt
#0 0x00007f9e2a4ed54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f9e2a4e8e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f9e2a4e8d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
Via Threads getting stuck with few threads at point "in __lll_lock_wait" I get to know that __lll_lock_wait() is called if we are not able to get a lock on the mutex, since something else (In this case I guess the Thread 2) is still locking it.
But Thread 2 is also stuck with given stack trace, and since they are not with debug symbols, I can't check who is the owner of the mutex. So my questions are:
What is the use of / cause of __lll_lock_wait_private ()
Is there any hint what and where could the issue be? Without availability of debug symbols.
Several times I have seen hang in case of malloc_consolidate() on linux.. Is this a well known and yet to be solved issue?
Frames 6 and 7 of thread 2 suggest a custom signal handler was installed. Frame 5 suggests it is trying to do something like write to a file (std::ofstream?).
That is not allowed. Very little is allowed in signal handlers, and definitely not iostreams.
Suppose you are in a function like malloc_consolidate which may have to touch the global arena, and take a lock to do it, and a signal comes along. If you allocate memory in the signal handler, you also need the same lock, which is already being held. Thread 2 is deadlocking itself.
I'm using zmq version 4.2.2. My program crashes because of a call to zmq_abort() which calls abort(). According to stack trace, if I understand correctly, zmq_abort() is called from src/socket_poller.cpp:54. However, that line is the beginning of the function definition:
zmq::socket_poller_t::~socket_poller_t ()
The function does not have direct calls to zmq_abort() or any assert macros that would call it. There are not many asserts or any direct calls to zmq_abort() in the whole file either. However, other lines in the stack trace seem to match the source code in github:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/socket_poller.cpp#L54
How does execution end up in zmq_abort()?
Beginning of stack trace:
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12cc9d4700 (LWP 23680))]
(gdb) where
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f12ce123415 in __GI_abort () at abort.c:90
#2 0x00007f12ce8db9c9 in zmq::zmq_abort (errmsg_=<optimized out>) at ../zeromq-4.2.2/src/err.cpp:87
#3 0x00007f12ce918cbe in zmq::socket_poller_t::~socket_poller_t (this=0x7f12c8004150,
__in_chrg=<optimized out>) at ../zeromq-4.2.2/src/socket_poller.cpp:54
#4 0x00007f12ce91793a in zmq_poller_destroy (poller_p_=0x7f12cc9d2af8)
at ../zeromq-4.2.2/src/zmq.cpp:1236
#5 0x00007f12ce917e14 in zmq_poller_poll (timeout_=<optimized out>, nitems_=2, items_=0x1)
at ../zeromq-4.2.2/src/zmq.cpp:854
#6 zmq_poll (items_=items_#entry=0x7f12cc9d2c20, nitems_=nitems_#entry=2, timeout_=timeout_#entry=5000)
at ../zeromq-4.2.2/src/zmq.cpp:866
zmq_abort() was called from an assertation macro in signaler_t's destructor:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/signaler.cpp#L143. The signaler_t object is a member of socket_poller_t. I don't know for sure why the call to the destructor is not shown in the stack trace.
I was trying not to ask (directly) what was wrong with my code because it was infeasible to provide a code sample, but I'll mention that it turned out to be that a file descriptor was erroneously closed twice in another thread. Between the two close operations, zmq_poll() created a socket_poller_t object. signaler_t's constructor opened an eventfd, which was the same fd (number) that had been closed earlier. Then, the other thread closed the same fd again, leading the destructor to get EBADF on close() and callzmq_abort().
C++ program is throwing this:
terminate called after throwing an instance of 'St9bad_alloc' what(): std::bad_alloc
Which it appears to be thrown from new, but the stack trace doesn't show any calls to new:
#0 0x0000003174a330c5 in raise () from /lib64/libc.so.6
#1 0x0000003174a34a76 in abort () from /lib64/libc.so.6
#2 0x00007f93b1b7b0b4 in __gnu_cxx::__verbose_terminate_handler ()
at ../../../../gcc-4.3.4/libstdc++-v3/libsupc++/vterminate.cc:98
#3 0x00007f93b1b794f6 in __cxxabiv1::__terminate (handler=0x522b)
at ../../../../gcc-4.3.4/libstdc++-v3/libsupc++/eh_terminate.cc:43
#4 0x00007f93b1b79523 in std::terminate ()
at ../../../../gcc-4.3.4/libstdc++-v3/libsupc++/eh_terminate.cc:53
#5 0x00007f93b1b79536 in __cxxabiv1::__unexpected (handler=0x522b)
at ../../../../gcc-4.3.4/libstdc++-v3/libsupc++/eh_terminate.cc:59
#6 0x00007f93b1b78ec8 in __cxxabiv1::__cxa_call_unexpected (exc_obj_in=0x7f93b1dae770)
at ../../../../gcc-4.3.4/libstdc++-v3/libsupc++/eh_personality.cc:750
#7 0x00007f93b2c356e0 in network::HttpLoader::doLoad (this=0x7f938801ef20) at loaders/HttpLoader.cxx:1071
#8 0x00007f93b2c70971 in network::Loader::load (this=0x522b) at Loader.cxx:899
#9 0x00007f93b2c74a15 in network::Loader::load2 (this=0x522b) at Loader.cxx:925
#10 0x00007f93b2c7b13a in network::LoaderThread::run() ()
#11 0x00007f93b1e60be4 in threads::Thread_startWorker (thr=0x7f938801e460) at Threads.cxx:479
#12 0x00007f93b1e60ead in threads::ThreadPool::run (this=0x1140478, thr=0x7f938801eeb0) at Threads.cxx:727
#13 0x00007f93b1e608e8 in threads::__Thread_startWorker (param=<value optimized out>) at Threads.cxx:520
#14 0x0000003175206ccb in start_thread () from /lib64/libpthread.so.0
#15 0x0000003174ae0c2d in clone () from /lib64/libc.so.6
Added debugging statements at the beginning of doLoad(), but it never gets to that point.
Stumped!
Any thoughts?
The new call may not be in the stack because it has already unwound at the point your application is terminating. I'd try to set a breakpoint at the moment the exception is thrown (e.g., using catch throw under gdb) -- at that point you'll see the cause of the exception in the stack.
Maby the new call was inlined due to optimizations. Try to disable optimizations, at least in loaders/HttpLoader.cxx.
I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.