Core dump in libc exit call - c++

I am seeing a core dump in solaris at the exit procedure of my program.. How to debug and fix this kind of core dump?
(gdb) where
#0 0xff2cc0c0 in kill () from /usr/lib/libc.so.1
#1 0x0004dac0 in run_before_killed_handler (sig=11) at NdmpServer.cpp:1186
#2 signal handler called
#3 0xfee0ad50 in ?? ()
#4 0x00060a6c in proc_cleanup ()
#5 0xff2421ac in _exithandle () from /usr/lib/libc.so.1
#6 0xff2305d8 in exit () from /usr/lib/libc.so.1
#7 0x0003431c in _start ()

Your program apparently uses atexit(3C) to register an exit handler. The problem is occuring in that handler.

Without knowing the finer details of Solaris memory layouts, 0xfee0ad50 seems to be on the OS side. What OS call are you trying (and failing) to make in proc_cleanup?

Related

localtime() function is invoking ___lll_lock_wait_private() which is making the thread to go deadlock

I have seen many questions in which ___lll_lock_wait_private () is going into deadlock. But their call stack is different. Their code was calling malloc() or free() or fork().
But in my case, I have a Log class, which is trying to print a log. That log is making my thread to go deadlock.
See below call stack,
#0 0x000000fff4c47b9c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000fff4bf0364 in __tz_convert () from /lib64/libc.so.6
#2 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
#3 0x000000fff5167188 in getTimeStr() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#4 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#5 0x000000fff5318c90 in DaemonCtrlServer::strtDaemon(daemonInfo&) () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#6 0x000000fff531abc0 in DaemonCtrlServer::restrtDiedDaemon(daemonInfo&) () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#7 0x000000fff531ae64 in DaemonCtrlServer::handleChildDeath() () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#8 0x00000000100080ac in sj_initd::DaemonDeathHandler() ()
#9 0x000000001000b8f8 in sj_initd::SignalHandler(int) ()
#10 0x00000000100080e8 in sj_initd_SigHandler::sj_initdSigHandler(int) ()
#11 <signal handler called>
#12 0x000000fff4bedc00 in __offtime () from /lib64/libc.so.6
#13 0x000000fff4bf02a8 in __tz_convert () from /lib64/libc.so.6
#14 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
#15 0x000000fff5167188 in getTimeStr() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#16 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#17 0x000000fff52e868c in ConnectionOS::ProcessReadEvent() () from /usr/sbin/sajet/sharedobj/libconnV2.so
#18 0x000000fff52ef354 in ConnectionOSManager::ProcessConns(fd_set*, fd_set*) () from /usr/sbin/sajet/sharedobj/libconnV2.so
#19 0x000000fff52f0a0c in SocketsManager::ProcessFds(bool) () from /usr/sbin/sajet/sharedobj/libconnV2.so
#20 0x000000fff52c51b4 in EventReactorBase::IO (this=0x19a65e80) at EventReactorBase.cpp:361
#21 0x000000fff52c457c in EventReactorBase::React (this=0x19a65e80) at EventReactorBase.cpp:419
#22 0x000000fff52c10cc in Task::Run (this=0x19a65e30) at Task.cpp:222
#23 0x000000fff52c1218 in startTask (t=0x19a65e30) at Task.cpp:152
#24 0x000000001000a9c4 in TaskManager::Start() ()
#25 0x0000000010007538 in main ()
sj_init is a daemon which monitors the live status of other daemons in a system. When a daemon dies(which closes the connection with sj_init), it tries to restart that daemon. Then, startDaemon() is trying to print a log, which is calling getTimeStr() which internally calling ___lll_lock_wait_private
Edit
as localtime is not threadsafe, I tried with localtime_r but it also lead the thread to go into a deadlock. but acc. to localtime_r description this is threadsafe function. What am I doing wrong here?
The programs blocks inside a signal handler in __tz_convert() being called from localtime(). The signal handler interrupted __tz_convert() being called from localtime().
#0 0x000000fff4c47b9c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000fff4bf0364 in __tz_convert () from /lib64/libc.so.6
#2 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
...
#11 <signal handler called>
...
#13 0x000000fff4bf02a8 in __tz_convert () from /lib64/libc.so.6
#14 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
...
#25 0x0000000010007538 in main ()
localtime() seems to not be reintrant.
A signal handler may only call a very specific set of functions. Scroll down here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03
localtime() is not within this set.
Signal handler should be short and simple.
You could set up a thread doing the formatting and logging which for example via a pipe is being fed by the signal handler with all necessary information to format and log. The functions to do so are listed within the set linked above.
#4 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#16 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
Notice that LogClass::logBegin appears in your call stack twice?
You have two core problems.
You are calling localtime from a signal handler. That is not allowed.
You are looking at the wrong thread's stack. This thread got stuck in a deadlock caused by other threads and then was interrupted, getting stuck in the deadlock again. It's a victim of the deadlock (twice!). The perpetrator is another thread.
If you're going to log from a signal handlers, you need very tight control over all the code that runs in the signal handler and you need to pass the "heavy lifting" off to another thread that can safely call non-reentrant functions.

How to determine reason of pthread_raise(sig=6) in core file with gdb

My app crashes sometime and I cant find the cause. My app is multithread (QThread) and use several QUdpSockets. I think it happens due to the simultaneous access to the socket, but I dont know when and where.
There is results of bt from core file:
#0 0x414596e1 in ?? ()
#1 0x412d731b in pthread_kill (thread=1649, signo=6) at signals.c:69
#2 0x412d76a0 in __pthread_raise (sig=6) at signals.c:200
#3 0x41459395 in ?? ()
#4 0x00000006 in ?? ()
#5 0x41546ff4 in ?? ()
#6 0xbd5fd8bc in ?? ()
#7 0x4145a87d in ?? ()
#8 0x00000006 in ?? ()
#9 0x00000020 in ?? ()
#10 0x00000000 in ?? ()
What is sig=6 and when it emited?
How can I determine the reason of this behavior?
How do I know which -dev libraries are missing (??? positions of the stack)?
Signal number 6 on Linux is SIGABRT - the fact that it's being raised with pthread_raise() seems to indicate that the application has directly called abort() or a failed assert().
It's likely that the missing parts of your backtrace are in the QT libraries, so try installing the debugging symbols for all of those.

Strange stack of a thread

I faced with crash of my application when it stops. Gdb shows following stack (app is built with -g -O0):
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007f254ea99700 in ?? ()
#2 0x0000000000000000 in ?? ()
Short investigation has shown that crash happens during stopping a thread which is started the same way as many others in the app:
// mListener is std::thread and member of class UA
std::thread thr(&UA::run, this);
mListener = std::move(thr);
Then I ran gdb on app before stopping and saw the difference between stacks of thread caused crash and other threads.
All threads looks like:
...
#8 0x000000000043a70a in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (UI::Keyboard::*)()> (UI::Keyboard*)> >::_M_run() (this=0xa88fd0)
at /usr/include/c++/4.9/thread:115
#9 0x00007fb6055c3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fb6083ff0a4 in start_thread (arg=0x7fb604042700) at pthread_create.c:309
#11 0x00007fb604d3304d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
But 'wrong' thread always looks different:
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1 0x000000000043317d in Semaphore::wait (this=0x7fb5fc0009e8) at /home/vadius/workspace/iPhone/core/src/Core/env/Semaphore.h:28
#2 0x0000000000432564 in SIP::UA::run (this=0x7fb5fc000980) at /home/vadius/workspace/iPhone/core/src/SIP/UA.cpp:132
#3 0x0000000000000000 in ?? ()
I assume that when thread exits from worker method (SIP::UA::run) it goes to code placed in nullptr.
My question is:
1. Am I right and stack of 'bad' thread is wrong?
2. What can be reason of such behavior and how to avoid it?
Debian jessie x64 /
GCC 4.9 /
Compile flags: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -DDEBUG -g -O0")
The giveaway is the "address". 432564 is "C%d". The bytes in a normal address are usually not all ASCII. This is a stack buffer overflow.

OpenSSL crashes while calling SSL_new() Library function

I am working with OpenSSL Library. When I execute the project I am facing crash issue from this line of the source code:
m_pSslFd = SSL_new(m_pCtx);
Declaration and initialization part is correct. Execution is working fine when this library method is called first time. But it crashes while this library method is called second time.
I am giving gdb back trace for this crash
(gdb) bt
#0 0x0000003dee876285 in malloc_consolidate () from /lib64/libc.so.6
#1 0x0000003dee879415 in _int_malloc () from /lib64/libc.so.6
#2 0x0000003dee87a9a1 in malloc () from /lib64/libc.so.6
#3 0x00000032c1c6abee in CRYPTO_malloc () from /usr/lib64/libcrypto.so.10
#4 0x00000032c202986a in ssl3_new () from /usr/lib64/libssl.so.10
#5 0x00000032c203bfae in dtls1_new () from /usr/lib64/libssl.so.10
#6 0x00000032c204534c in SSL_new () from /usr/lib64/libssl.so.10
#7 0x00007ffff7882bf7 in DTLSCore::DoDTLSClientNegotiation (this=0x858940, iFd=#0x7fff635fd3bc, speer=...)at src/afg/DTLSCore.cpp:236
Any suggestion will be helpful for me. Thank You.

infinite abort() in a backrace of a c++ program core dump

I have a strange problem that I can't solve. Please help!
The program is a multithreaded c++ application that runs on ARM Linux machine. Recently I began testing it for the long runs and sometimes it crashes after 1-2 days like so:
*** glibc detected ** /root/client/my_program: free(): invalid pointer: 0x002a9408 ***
When I open core dump I see that the main thread it seems has a corrupt stack: all I can see is infinite abort() calls.
GNU gdb (GDB) 7.3
...
This GDB was configured as "--host=i686 --target=arm-linux".
[New LWP 706]
[New LWP 700]
[New LWP 702]
[New LWP 703]
[New LWP 704]
[New LWP 705]
Core was generated by `/root/client/my_program'.
Program terminated with signal 6, Aborted.
#0 0x001c44d4 in raise ()
(gdb) bt
#0 0x001c44d4 in raise ()
#1 0x001c47e0 in abort ()
#2 0x001c47e0 in abort ()
#3 0x001c47e0 in abort ()
#4 0x001c47e0 in abort ()
#5 0x001c47e0 in abort ()
#6 0x001c47e0 in abort ()
#7 0x001c47e0 in abort ()
#8 0x001c47e0 in abort ()
#9 0x001c47e0 in abort ()
#10 0x001c47e0 in abort ()
#11 0x001c47e0 in abort ()
And it goes on and on. I tried to get to the bottom of it by moving up the stack: frame 3000 or even more, but eventually core dump runs out of frames and I still can't see why this has happened.
When I examine the other threads everything seems normal there.
(gdb) info threads
Id Target Id Frame
6 LWP 705 0x00132f04 in nanosleep ()
5 LWP 704 0x001e7a70 in select ()
4 LWP 703 0x00132f04 in nanosleep ()
3 LWP 702 0x00132318 in sem_wait ()
2 LWP 700 0x00132f04 in nanosleep ()
* 1 LWP 706 0x001c44d4 in raise ()
(gdb) thread 5
[Switching to thread 5 (LWP 704)]
#0 0x001e7a70 in select ()
(gdb) bt
#0 0x001e7a70 in select ()
#1 0x00057ad4 in CSerialPort::read (this=0xbea7d98c, string_buffer=..., delimiter=..., timeout_ms=1000) at CSerialPort.cpp:202
#2 0x00070de4 in CScanner::readResponse (this=0xbea7d4cc, resp_recv=..., timeout=1000, delim=...) at PidScanner.cpp:657
#3 0x00071198 in CScanner::sendExpect (this=0xbea7d4cc, cmd=..., exp_str=..., rcv_str=..., timeout=1000) at PidScanner.cpp:604
#4 0x00071d48 in CScanner::pollPid (this=0xbea7d4cc, mode=1, pid=12, pid_str=...) at PidScanner.cpp:525
#5 0x00072ce0 in CScanner::poll1 (this=0xbea7d4cc)
#6 0x00074c78 in CScanner::Poll (this=0xbea7d4cc)
#7 0x00089edc in CThread5::Thread5Poll (this=0xbea7d360)
#8 0x0008c140 in CThread5::run (this=0xbea7d360)
#9 0x00088698 in CThread::threadFunc (p=0xbea7d360)
#10 0x0012e6a0 in start_thread ()
#11 0x001e90e8 in clone ()
#12 0x001e90e8 in clone ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(Classes and functions names are a bit wierd because I changed them -:)
So, thread #1 is where the stack is corrupt, backtrace of every other (2-6) shows
Backtrace stopped: previous frame identical to this frame (corrupt stack?).
It happends because threads 2-6 are created in the thread #1.
The thing is that I can't run the program in gdb because it runs on an embedded system. I can't use remote gdb server. The only option is examining core dumps that occur not very often.
Could you please suggest something that could move me forward with this? (Maybe something else I can extract from the core dump or maybe somehow to make some hooks in the code to catch abort() call).
UPDATE: Basile Starynkevitch suggested to use Valgrind, but turns out it's ported only for ARMv7. I have ARM 926 which is ARMv5, so this won't work for me. There are some efforts to compile valgrind for ARMv5 though: Valgrind cross compilation for ARMv5tel, valgrind on the ARM9
UPDATE 2: Couldn't make Electric Fence work with my program. The program uses C++ and pthreads. The version of Efence I got, 2.1.13 crashed in a arbitrary place after I start a thread and try to do something more or less complicated (for example to put a value into an STL vector). I saw people mentioning some patches for Efence on the web but didn't have time to try them. I tried this on my Linux PC, not on the ARM, and other tools like valgrind or Dmalloc don't report any problems with the code. So, everyone using version 2.1.13 of efence be prepared to have problems with pthreads (or maybe pthread + C++ + STL, don't know).
My guess for the "infinite' aborts is that either abort() causes a loop (e.g. abort -> signal handler -> abort -> ...) or that gdb can't correctly interpret the frames on the stack.
In either case I would suggest manually checking out the stack of the problematic thread. If abort causes a loop, you should see a pattern or at least the return address of abort repeating every so often. Perhaps you can then more easily find the root of the problem by manually skipping large parts of the (repeating) stack.
Otherwise, you should find that there is no repeating pattern and hopefully the return address of the failing function somewhere on the stack. In the worst case such addresses are overwritten due to a buffer overflow or such, but perhaps then you can still get lucky and recognise what it is overwritten with.
One possibility here is that something in that thread has very, very badly smashed the stack by vastly overwriting an on-stack data structure, destroying all the needed data on the stack in the process. That makes postmortem debugging very unpleasant.
If you can reproduce the problem at will, the right thing to do is to run the thread under gdb and watch what is going on precisely at the moment when the the stack gets nuked. This may, in turn, require some sort of careful search to determine where exactly the error is happening.
If you cannot reproduce the problem at will, the best I can suggest is very carefully looking for clues in the thread local storage for that thread to see if it hints at where the thread was executing before death hit.