Boost threads coring on startup - c++

I have a program that brings up and tears down multiple threads throughout its life. Everything works great for awhile, but eventually, I get the following core dump stack trace.
#0 0x009887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x007617a5 in raise () from /lib/tls/libc.so.6
#2 0x00763209 in abort () from /lib/tls/libc.so.6
#3 0x003ec1bb in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x003e9ed1 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.6
#5 0x003e9f06 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x003ea04f in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0x00d5562b in boost::thread::start_thread () from /h/Program/bin/../lib/libboost_thread-gcc34-mt-1_39.so.1.39.0
At first, I was leaking threads, and figured the core was due to hitting some maximum limit of number of current threads, but now it seems that this problems occurs even when I don't. For reference, in the core above there were 13 active threads executing.
I did some searching to try and figure out why start_thread would core, but I didn't come across anything. Anyone have any ideas?

start_thread is throwing an uncaught exception, see which exceptions can start_thread throw and place a catch around it to see what is the problem.

What are the values carried by thread_resource_error? It looks like you can call native_error() to find out.
Since this is a wrapper around pthreads there are only a couple of possibilities - EAGAIN, EINVAL and EPERM. It looks as if boost has exceptions it would likely throw for EINVAL and EPERM - i.e. unsupported_thread_option() and thread_permission_error().
That pretty much leaves EAGAIN so I would double check that you really aren't exceeding the system limits on the number of threads. You are sure you are joining them, or if detached, they are really gone?

Related

What is __lll_lock_wait_private and what can cause a hang while malloc_consolidate is called?

I have used 2 threads, but they are getting stuck with following stack trace:
Thread 2:
(gdb) bt
#0 0x00007f9e1d7625bc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f9e1d6deb35 in _L_lock_17166 () from /lib64/libc.so.6
#2 0x00007f9e1d6dbb73 in malloc () from /lib64/libc.so.6
#3 0x00007f9e1d6c4bad in __fopen_internal () from /lib64/libc.so.6
#4 0x00007f9e1dda2210 in std::__basic_file<char>::open(char const*, std::_Ios_Openmode, int) () from /lib64/libstdc++.so.6
#5 0x00007f9e1dddd5ba in std::basic_filebuf<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) () from /lib64/libstdc++.so.6
#6 0x00000000005e1244 in fatalSignalHandler(int, siginfo*, void*) ()
#7 <signal handler called>
#8 0x00007f9e1d6d6839 in malloc_consolidate () from /lib64/libc.so.6
#9 0x00007f9e1d6d759e in _int_free () from /lib64/libc.so.6
_int_free is getting called as a result of default destructor.
Thread 1:
(gdb) bt
#0 0x00007f9e2a4ed54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f9e2a4e8e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f9e2a4e8d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
Via Threads getting stuck with few threads at point "in __lll_lock_wait" I get to know that __lll_lock_wait() is called if we are not able to get a lock on the mutex, since something else (In this case I guess the Thread 2) is still locking it.
But Thread 2 is also stuck with given stack trace, and since they are not with debug symbols, I can't check who is the owner of the mutex. So my questions are:
What is the use of / cause of __lll_lock_wait_private ()
Is there any hint what and where could the issue be? Without availability of debug symbols.
Several times I have seen hang in case of malloc_consolidate() on linux.. Is this a well known and yet to be solved issue?
Frames 6 and 7 of thread 2 suggest a custom signal handler was installed. Frame 5 suggests it is trying to do something like write to a file (std::ofstream?).
That is not allowed. Very little is allowed in signal handlers, and definitely not iostreams.
Suppose you are in a function like malloc_consolidate which may have to touch the global arena, and take a lock to do it, and a signal comes along. If you allocate memory in the signal handler, you also need the same lock, which is already being held. Thread 2 is deadlocking itself.

All threads in wait in core dump file, but someone triggered SIG_ABRT

I am supporting an application written in C++ over many years and as of late it has started to crash providing core dumps that we don't know how to handle.
It runs on an appliance on Ubuntu 14.04.5
When loading the core file in GDB it says that:
Program terminated with signal SIGABRT, Aborted
I can inspect 230 threads but they are all in wait() in the exact same memory position.
There is a thread with ID 1 that in theory could be the responsible but that thread is also in wait.
So I have two questions basically.
How does the id index of the threads work?
Is thread with GDB ID 1 the last active thread? or is that an arbitrary index and the failure can be in any of the other threads?
How can all threads be in wait() when a SIGABRT is triggered?
Shouldn't the instruction pointer be at the failing command when the OS decided to step in an halt the process? Or is it some sort of deadlock protection?
Any help much appreciated.
Backtrace of thread 1:
#0 0xf771dcd9 in ?? ()
#1 0xf74ad4ca in _int_free (av=0x38663364, p=<optimized out>,have_lock=-186161432) at malloc.c:3989
#2 0xf76b41ab in std::string::_Rep::_M_destroy(std::allocator<char> const&) () from /usr/lib32/libstdc++.so.6
#3 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#4 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#5 0x5685e8b4 in SlimStringMapper::~SlimStringMapper() ()
#6 0x567d6bc3 in destroy ()
#7 0x566a40b4 in HttpProxy::getLogonCredentials(HttpClient*, HttpServerTransaction*, std::string const&, std::string const&, std::string&, std::string&) ()
#8 0x566a5d04 in HttpProxy::add_authorization_header(HttpClient*, HttpServerTransaction*, Hosts::Host*) ()
#9 0x566af97c in HttpProxy::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#10 0x566d597e in callOnClientRequest(HttpClient*, HttpServerTransaction*, FastHttpRequest*) ()
#11 0x566d169f in GateKeeper::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#12 0x566a2291 in HttpClientThread::run() ()
#13 0x5682e37c in wa_run_thread ()
#14 0xf76f6f72 in start_thread (arg=0xec65ab40) at pthread_create.c:312
#15 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#16 0xec65ab40 in ?? ()
Another thread that should be in wait:
#0 0xf771dcd9 in ?? ()
#1 0x5682e37c in wa_run_thread ()
#2 0xf76f6f72 in start_thread (arg=0xf33bdb40) at pthread_create.c:312
#3 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#4 0xf33bdb40 in ?? ()
Best regards
Jon
How can all threads be in wait() when a SIGABRT is triggered?
Is wait the POSIX function, or something from the run-time environment? Are you looking at a higher-level backtrace?
Anyway, there is an easy explanation why this can happen: SIGABRT was sent to the process, and not generated by a thread in a synchronous fashion. Perhaps a coworker sent the signal to create the coredump, after observing the deadlock, to collect evidence for future analysis?
How does the id index of the threads work? Is thread with GDB ID 1 the last active thread?
When the program is running under GDB, GDB numbers threads as it discovers them, so thread 1 is always the main thread.
But when loading a core dump, GDB discoveres threads in the order in which the kernel saved them. The kernels that I have seen always save the thread which caused program termination first, so usually loading core into GDB immediately gets you to the crash point without the need to switch threads.
How can all threads be in wait() when a SIGABRT is triggered?
One possiblity is that you are not analyzing the core correctly. In particular, you need exact copies of shared libraries that were used at the time when the core was produced, and that's unlikely to be the case when the application runs on "appliance" and you are analysing core on your development machine. See this answer.
I just saw your question. First of all my answer is not specific to you direct question but some solution to handle this kind of situation. Multi-threading entirely depend on the hardware and operating system of a machine. Especially memory and processors. Increase in thread means requirement of more memory as well as more time slice for processor. I don’t think your application have more than 100 processor to facilitate 230 thread to run concurrently with highest performance. To avoid this situation do the below steps which may help you.
Control the creation of threads. Control number of threads running concurrently.
Increase the memory size of your application. (check compiler options to increase memory for the application at run time or O/S to allocate enough memory)
Set grid size and stack size of each thread properly. (calculation need to be done based on your application’s threads functionality, this is bit complicated. Please read some documentation)
Handle synchronized block properly to avoid any deadlock.
Where necessary use conditional lock etc.
As you told that most of your threads are in wait condition, that means they are waiting for a lock to release for their turn, that means one of the thread already acquire the lock and still busy in processing or probably in deadlock situation.

What are the possible reasons for POSIX SIGBUS?

My program recently crashed with the following stack;
Program terminated with signal 7, Bus error.
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f0f323beb55 in raise () from /lib64/libc.so.6
#1 0x00007f0f35f8042e in skgesigOSCrash () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#2 0x00007f0f36222ca9 in kpeDbgSignalHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#3 0x00007f0f35f8063e in skgesig_sigactionHandler () from /usr/lib/oracle/11.2/client64/lib/libclntsh.so.11.1
#4 <signal handler called>
What should I check in my code to avoid this? Or is this something Oracle should fix?
Main reasons you could get a bus error revolves around inaccessible memory. This could be due to many reasons:
Accessing through a deleted pointer.
Accessing through an uninitialized pointer.
Accessing through a NULL pointer.
Accessing the address which is not yours. It could be due to overflow errors.
Try adding the following to the $ORACLE_HOME/network/admin/*.ora file:
DIAG_ADR_ENABLED=OFF
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE
This sounds like an Oracle issue.
And also Oracle's libraries seem to be compiled by Intel compilers.

C++ server crashes with abort() in _UTF8_init() on free()

I'm having problems with C++ code loaded via dlopen() by a C++ CGI server. After a while, the program crashes unexpectedly, but consistently at memory management function call (such as free(), calloc(), etc.) and produces core dump similar to this:
#0 0x0000000806b252dc in kill () from /lib/libc.so.6
#1 0x0000000804a1861e in raise () from /lib/libpthread.so.2
#2 0x0000000806b2416d in abort () from /lib/libc.so.6
#3 0x0000000806abdb45 in _UTF8_init () from /lib/libc.so.6
#4 0x0000000806abdfcc in _UTF8_init () from /lib/libc.so.6
#5 0x0000000806abeb1d in _UTF8_init () from /lib/libc.so.6
... the rest of the stack
Has anyone seen something like this before?
What is _UTF8_init() and why would memory management functions call it?
That smells like a corrupted heap, likely due to a buffer overrun somewhere in your code. Try running your program with Valgrind and look for any errors or warnings it emits.

Simultaneous abort() in two threads

I have a backtrace with something I haven't seen before. See frame 2 in these threads:
Thread 31 (process 8752):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0b139 in sigprocmask () from /lib/libc.so.6
#2 0x00b0c7a2 in abort () from /lib/libc.so.6
#3 0x00752aa0 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x00750505 in ?? () from /usr/lib/libstdc++.so.6
#5 0x00750542 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x00750c65 in __cxa_pure_virtual () from /usr/lib/libstdc++.so.6
#7 0x00299c63 in ApplicationFunction()
Thread 1 (process 8749):
#0 0x00faa410 in __kernel_vsyscall ()
#1 0x00b0ad80 in raise () from /lib/libc.so.6
#2 0x00b0c691 in abort () from /lib/libc.so.6
#3 0x00b4324b in __libc_message () from /lib/libc.so.6
#4 0x00b495b6 in malloc_consolidate () from /lib/libc.so.6
#5 0x00b4b3bd in _int_malloc () from /lib/libc.so.6
#6 0x00b4d3ab in malloc () from /lib/libc.so.6
#7 0x08147f03 in AnotherApplicationFunction ()
When opening it with gdb and getting backtrace it gives me thread 1. Later I saw the weird state that thread 31 is in. This thread is from the library that we had problems with so I'd believe the crash is caused by it.
So what does it mean? Two threads simultaneously doing something illegal? Or it's one of them, causing somehow abort() in the other one?
The OS is Linux Red Hat Enterprise 5.3, it's a multiprocessor server.
It is hard to be sure, but my first suspicion upon seeing these stack traces would be a memory corruption (possibly a buffer overrun on the heap). If that's the case, then the corruption is probably the root cause of both threads ending up in abort.
Can you valgrind your app?
Looks like it could be heap corruption, detected by malloc in thread 1, causing or caused by the error in thread 31.
Some broken piece of code overwriting a.o. the vtable in thread 31 could easily cause this.
It's possible that the reason thread 31 aborted is because it trashed the application heap in some way. Then when the main thread tried to allocate memory the heap data structure was in a bad state, causing the allocation to fail and abort the application again.