SIGSEGV on program exit with boost::log - c++

Some time ago we separate our big project with almost static libraries to many projects with dynamic libraries.
Since then we stated seeing problems on shutdown.
Sometimes, the process would not terminate. With gdb I found, that on object destruction a segfault occurs, but the process is blocked in futex_wait.
I've since improved the code, by creating global objects are now created in function, instead of global static data. That reduced the problem: it doesn't happen in my development environment anymore.
However, in test environment (rare) and in production environment (often) processes still get stuck on shutdown. So we need to restart container manually, or have some kind of health check.
We are trying to simulate this kind of situation on standalone docker container running under Kubernetes where we have the process running under circusd and we see following:
#0 malloc_consolidate (av=0xf47fc400 <main_arena>) at malloc.c:4151
#1 0xf46ff1ab in _int_free (av=0xf47fc400 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4057
#2 0xf48c6e68 in operator delete(void*) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#3 0xf52d173d in std::_Deque_base<boost::log::v2_mt_posix::record_view, std::allocator<boost::log::v2_mt_posix::record_view> >::~_Deque_base() () from /usr/local/lib/liblog.so.0
#4 0xf52d18b3 in std::deque<boost::log::v2_mt_posix::record_view, std::allocator<boost::log::v2_mt_posix::record_view> >::~deque() () from /usr/local/lib/liblog.so.0
#5 0xf52d1940 in boost::log::v2_mt_posix::sinks::bounded_fifo_queue<4000u, boost::log::v2_mt_posix::sinks::drop_on_overflow>::~bounded_fifo_queue() () from /usr/local/lib/liblog.so.0
#6 0xf52d462e in boost::log::v2_mt_posix::sinks::asynchronous_sink<cout_sink, boost::log::v2_mt_posix::sinks::bounded_fifo_queue<4000u, boost::log::v2_mt_posix::sinks::drop_on_overflow>
>::~asynchronous_sink() () from /usr/local/lib/liblog.so.0
#7 0xf52d47f4 in asynchronous_sink<cout_sink>::~asynchronous_sink() () from /usr/local/lib/liblog.so.0
#8 0xf52c199a in boost::detail::sp_counted_impl_pd<asynchronous_sink<cout_sink>*, boost::detail::sp_ms_deleter<asynchronous_sink<cout_sink> >
>::dispose() () from /usr/local/lib/liblog.so.0
#9 0xf51f3e7b in boost::log::v2_mt_posix::core::~core() () from /usr/lib/libboost_log.so.1.58.0
#10 0xf51f6529 in boost::detail::sp_counted_impl_p<boost::log::v2_mt_posix::core>::dispose() () from /usr/lib/libboost_log.so.1.58.0
#11 0xf51f6160 in boost::shared_ptr<boost::log::v2_mt_posix::core>::~shared_ptr() () from /usr/lib/libboost_log.so.1.58.0
#12 0xf46bcfb3 in __cxa_finalize (d=0xf526fa88) at cxa_finalize.c:56
#13 0xf51eaab3 in ?? () from /usr/lib/libboost_log.so.1.58.0
#14 0xf7769e2c in _dl_fini () at dl-fini.c:252
#15 0xf46bcc21 in __run_exit_handlers (status=status#entry=0, listp=0xf47fc3a4 <__exit_funcs>, run_list_atexit=run_list_atexit#entry=true) at exit.c:82
#16 0xf46bcc7d in __GI_exit (status=0) at exit.c:104
#17 0xf46a572b in __libc_start_main (main=0x8060dc0, argc=5, argv=0xffdd1514, init=0x8088090, fini=0x8088100, rtld_fini=0xf7769c50 <_dl_fini>, stack_end=0xffdd150c) at libc-start.c:321
#18 0x080630cc in ?? ()
I have no ideas how to progress from here. What is happening? Why do we get the segfault in boost::log::core destruction in this environment?
Does anyone have some advice how can I find it, probably, based on experience?

Related

localtime() function is invoking ___lll_lock_wait_private() which is making the thread to go deadlock

I have seen many questions in which ___lll_lock_wait_private () is going into deadlock. But their call stack is different. Their code was calling malloc() or free() or fork().
But in my case, I have a Log class, which is trying to print a log. That log is making my thread to go deadlock.
See below call stack,
#0 0x000000fff4c47b9c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000fff4bf0364 in __tz_convert () from /lib64/libc.so.6
#2 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
#3 0x000000fff5167188 in getTimeStr() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#4 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#5 0x000000fff5318c90 in DaemonCtrlServer::strtDaemon(daemonInfo&) () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#6 0x000000fff531abc0 in DaemonCtrlServer::restrtDiedDaemon(daemonInfo&) () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#7 0x000000fff531ae64 in DaemonCtrlServer::handleChildDeath() () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#8 0x00000000100080ac in sj_initd::DaemonDeathHandler() ()
#9 0x000000001000b8f8 in sj_initd::SignalHandler(int) ()
#10 0x00000000100080e8 in sj_initd_SigHandler::sj_initdSigHandler(int) ()
#11 <signal handler called>
#12 0x000000fff4bedc00 in __offtime () from /lib64/libc.so.6
#13 0x000000fff4bf02a8 in __tz_convert () from /lib64/libc.so.6
#14 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
#15 0x000000fff5167188 in getTimeStr() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#16 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#17 0x000000fff52e868c in ConnectionOS::ProcessReadEvent() () from /usr/sbin/sajet/sharedobj/libconnV2.so
#18 0x000000fff52ef354 in ConnectionOSManager::ProcessConns(fd_set*, fd_set*) () from /usr/sbin/sajet/sharedobj/libconnV2.so
#19 0x000000fff52f0a0c in SocketsManager::ProcessFds(bool) () from /usr/sbin/sajet/sharedobj/libconnV2.so
#20 0x000000fff52c51b4 in EventReactorBase::IO (this=0x19a65e80) at EventReactorBase.cpp:361
#21 0x000000fff52c457c in EventReactorBase::React (this=0x19a65e80) at EventReactorBase.cpp:419
#22 0x000000fff52c10cc in Task::Run (this=0x19a65e30) at Task.cpp:222
#23 0x000000fff52c1218 in startTask (t=0x19a65e30) at Task.cpp:152
#24 0x000000001000a9c4 in TaskManager::Start() ()
#25 0x0000000010007538 in main ()
sj_init is a daemon which monitors the live status of other daemons in a system. When a daemon dies(which closes the connection with sj_init), it tries to restart that daemon. Then, startDaemon() is trying to print a log, which is calling getTimeStr() which internally calling ___lll_lock_wait_private
Edit
as localtime is not threadsafe, I tried with localtime_r but it also lead the thread to go into a deadlock. but acc. to localtime_r description this is threadsafe function. What am I doing wrong here?
The programs blocks inside a signal handler in __tz_convert() being called from localtime(). The signal handler interrupted __tz_convert() being called from localtime().
#0 0x000000fff4c47b9c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000fff4bf0364 in __tz_convert () from /lib64/libc.so.6
#2 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
...
#11 <signal handler called>
...
#13 0x000000fff4bf02a8 in __tz_convert () from /lib64/libc.so.6
#14 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
...
#25 0x0000000010007538 in main ()
localtime() seems to not be reintrant.
A signal handler may only call a very specific set of functions. Scroll down here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03
localtime() is not within this set.
Signal handler should be short and simple.
You could set up a thread doing the formatting and logging which for example via a pipe is being fed by the signal handler with all necessary information to format and log. The functions to do so are listed within the set linked above.
#4 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#16 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
Notice that LogClass::logBegin appears in your call stack twice?
You have two core problems.
You are calling localtime from a signal handler. That is not allowed.
You are looking at the wrong thread's stack. This thread got stuck in a deadlock caused by other threads and then was interrupted, getting stuck in the deadlock again. It's a victim of the deadlock (twice!). The perpetrator is another thread.
If you're going to log from a signal handlers, you need very tight control over all the code that runs in the signal handler and you need to pass the "heavy lifting" off to another thread that can safely call non-reentrant functions.

Core dump in zmq library in a multi-threaded application with optimiized binary

This core dump on zmq library happened on field (not reproducible yet) with an optimized binary.
#0 0x00007f44a00801f7 in raise () from /lib64/libc.so.6
#1 0x00007f44a00818e8 in abort () from /lib64/libc.so.6
#2 0x00007f44a1f74759 in zmq::zmq_abort(char const*) () from /lib64/libzmq.so.5
#3 0x00007f44a1fa410d in zmq::tcp_write(int, void const*, unsigned long) () from /lib64/libzmq.so.5
#4 0x00007f44a1f9f417 in zmq::stream_engine_t::out_event() () from /lib64/libzmq.so.5
#5 0x00007f44a1f7437a in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#6 0x00007f44a1fa83a6 in thread_routine () from /lib64/libzmq.so.5
#7 0x00007f44a1b2ce25 in start_thread () from /lib64/libpthread.so.0
#8 0x00007f44a014334d in clone () from /lib64/libc.so.6enter code here
While I am analyzing my application code and hoping to find some misuse of zmq (probably using same zmq socket by 2 different threads or some other memory corruption), I would like to know what else can i get from this core-dump?
For a start, I can see total 102 threads running at the dump time. A many of them are in the epoll_wait.
#0 0x00007f44a0143923 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f44a1f74309 in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#2 0x00007f44a1fa83a6 in thread_routine () from /lib64/libzmq.so.5
#3 0x00007f44a1b2ce25 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f44a014334d in clone () from /lib64/libc.so.6
The other threads pointing to application code do not look suspicious yet.
The errno printed is 14 = EFAULT (Bad address).
Can i try to get anything from the disassembly? I have not debugged many disassembly in the past. But in this situation if i can get any clue, i can jump-in.
Any (other) advice/pointer will also be highly appreciated.
Thanks.

OpenSSL crashes while calling SSL_new() Library function

I am working with OpenSSL Library. When I execute the project I am facing crash issue from this line of the source code:
m_pSslFd = SSL_new(m_pCtx);
Declaration and initialization part is correct. Execution is working fine when this library method is called first time. But it crashes while this library method is called second time.
I am giving gdb back trace for this crash
(gdb) bt
#0 0x0000003dee876285 in malloc_consolidate () from /lib64/libc.so.6
#1 0x0000003dee879415 in _int_malloc () from /lib64/libc.so.6
#2 0x0000003dee87a9a1 in malloc () from /lib64/libc.so.6
#3 0x00000032c1c6abee in CRYPTO_malloc () from /usr/lib64/libcrypto.so.10
#4 0x00000032c202986a in ssl3_new () from /usr/lib64/libssl.so.10
#5 0x00000032c203bfae in dtls1_new () from /usr/lib64/libssl.so.10
#6 0x00000032c204534c in SSL_new () from /usr/lib64/libssl.so.10
#7 0x00007ffff7882bf7 in DTLSCore::DoDTLSClientNegotiation (this=0x858940, iFd=#0x7fff635fd3bc, speer=...)at src/afg/DTLSCore.cpp:236
Any suggestion will be helpful for me. Thank You.

gdb - get exactly unexpected error from core file

What exactly I need to do to get the unexpected error code or something similar to it form the core file with GDB or some other tool, to get the idea why my daemon died at operator new?
(gdb) bt
#0 0x48775bd7 in thr_kill () from /lib/libc.so.7
#1 0x48726f46 in pthread_kill () from /lib/libthr.so.3
#2 0x487245da in raise () from /lib/libthr.so.3
#3 0x4880abba in abort () from /lib/libc.so.7
#4 0x4866e65f in __gnu_cxx::__verbose_terminate_handler ()
from /usr/lib/libstdc++.so.6
#5 0x486729aa in std::set_unexpected () from /usr/lib/libstdc++.so.6
#6 0x486729f2 in std::terminate () from /usr/lib/libstdc++.so.6
#7 0x486728ea in __cxa_throw () from /usr/lib/libstdc++.so.6
#8 0x486c77ac in operator new () from /usr/lib/libstdc++.so.6
#9 0x0806ad4c in XXX::process_in (this=0x4b110d40,
map_settings_to_save=#0x7f7fcc98, str_answer=#0x7f7fcf84)
at Click.cpp:2940
Go to line 2940 of Click.cpp; you should find that someone is instantiating a new object. There was some error in the constructor.
From the looks of it, the heap in your application is trashed.
If the program used to work, and you changed something before this, potentially something that could have damaged the heap, check that carefully.

QT5.2.0; Debian Wheezy: QSocketNotifier::type() segfault

I have a multithreaded app that uses QThreadPool. It crashes after a random amount of time (sometimes minutes, sometimes hours...) with a segfault. I recompiled with debugging symbols and ran through GDB. Here's the backtrace:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7231170 in QSocketNotifier::type() const ()
from /opt/Qt/5.2.0/gcc_64/lib/libQt5Core.so.5
(gdb) where
#0 0x00007ffff7231170 in QSocketNotifier::type() const ()
from /opt/Qt/5.2.0/gcc_64/lib/libQt5Core.so.5
#1 0x00007ffff724b732 in ?? () from /opt/Qt/5.2.0/gcc_64/lib/libQt5Core.so.5
#2 0x00007ffff51d713b in g_main_context_check ()
from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3 0x00007ffff51d75c2 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4 0x00007ffff51d7744 in g_main_context_iteration ()
from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007ffff724c023 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /opt/Qt/5.2.0/gcc_64/lib/libQt5Core.so.5
#6 0x00007ffff71fa2cb in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /opt/Qt/5.2.0/gcc_64/lib/libQt5Core.so.5
#7 0x00007ffff71fe33e in QCoreApplication::exec() ()
from /opt/Qt/5.2.0/gcc_64/lib/libQt5Core.so.5
#8 0x0000000000409bb9 in main (argc=1, argv=<optimized out>) at main.cpp:166
That's the complete backtrace. It references/mentions basically no code within the app itself; it all appears to be Qt library code causing the fault. Not sure what source from the app itself to include in this post since GDB does not reference anything within the app itself. Any ideas?