OpenSSL SSL_shutdown received signal SIGPIPE, Broken pipe - c++

I'm writing a http/https client using openssl-0.9.8e
I get error when I call SSL_read()
then, I call SSL_get_error get SSL_ERROR_SYSCALL and errno ECONNRESET 104 /* Connection reset by peer */
accoring to SSL documetation that's what it means:
SSL_ERROR_SYSCALL
Some I/O error occurred. The OpenSSL error queue may contain more information on the error.
If the error queue is empty (i.e. ERR_get_error() returns 0), ret can be used to find out more about the
error: If ret == 0, an EOF was observed that violates the protocol. If ret == -1, the underlying BIO
reported an I/O error (for socket I/O on Unix systems, consult errno for details).
well, Connection reset, I call SSL_shutdown to close connection, oh,Program received signal SIGPIPE, Broken pipe.
God, I call signal(SIGPIPE, SIG_IGN); to ignore "SIGPIPE" signal,but it seem to does't work~
A Segmentation Fault happen
#0 0x00000032bd00d96b in write () from /lib64/libpthread.so.0
#1 0x0000003add478367 in ?? () from /lib64/libcrypto.so.6
#2 0x0000003add4766fe in BIO_write () from /lib64/libcrypto.so.6
#3 0x0000003add8208fd in ssl3_write_pending () from /lib64/libssl.so.6
#4 0x0000003add820d9a in ssl3_dispatch_alert () from /lib64/libssl.so.6
#5 0x0000003add81e982 in ssl3_shutdown () from /lib64/libssl.so.6
#6 0x00000000004565d0 in CWsPollUrl::SSLClear (this=<value optimized out>, ctx=0x2aaab804a1b0, ssl=0x2aaab804a680)
at ../src/Wspoll.cpp:1122
#7 0x00000000004575e0 in CWsPollUrl::asyncEventDelete (this=0x4d422e50, eev=0x2aaab8001160) at ../src/Wspoll.cpp:1546
#8 0x000000000045928a in CWsPollUrl::onFail (this=0x4d422e50, eev=0x2aaab8001160, errorCode=4) at ../src/Wspoll.cpp:1523
#9 0x000000000045ab17 in CWsPollUrl::handleData (this=0x4d422e50, eev=0x2aaab8001160, len=<value optimized out>) at ../src/Wspoll.cpp:1259
#10 0x000000000045abcc in CWsPollUrl::asyncRecvEvent (this=0x4d422e50, fd=<value optimized out>, eev=0x2aaab8001160)
at ../src/Wspoll.cpp:1211
#11 0x00000000004636b5 in event_base_loop (base=0x14768360, flags=0) at event.c:1350
#12 0x0000000000456a62 in CWsPollUrl::run (this=<value optimized out>, param=<value optimized out>) at ../src/Wspoll.cpp:461
#13 0x0000000000436c5c in doPollUrl (data=<value optimized out>, user_data=<value optimized out>) at ../src/PollStrategy.cpp:151
#14 0x00000032bf44a95d in ?? () from /lib64/libglib-2.0.so.0
#15 0x00000032bf448e04 in ?? () from /lib64/libglib-2.0.so.0
#16 0x00000032bd00677d in start_thread () from /lib64/libpthread.so.0
#17 0x00000032bc4d3c1d in clone () from /lib64/libc.so.6
why I get SIGPIPE signal, I have already called signal(SIGPIPE, SIG_IGN);
Does anyone know why?
Thanks in advance

If you get an I/O error with SSL_read it makes not much sense to call SSL_shutdown, because the shutdown tries to send a "close notify" shutdown alert to the peer and this will obviously not work on a broken connection. Therefore you get the SIGPIPE or EPIPE. Getting ECONNRESET from SSL_read in this case probably means, that the client has closed the connection hard, e.g. without doing an SSL_shutdown. You should not continue working with the socket after the error, e.g. not even doing an SSL_shutdown.

In addition to the #SteffenUllrich answer you can also call SSL_get_shutdown before calling SSL_shutdown and check if the SSL_SENT_SHUTDOWN flag was already set. You can do something like this:
//Perform a mutex lock here
if(SSL_get_shutdown(ssl) & SSL_SENT_SHUTDOWN)
{
printf("shutdown request ignored\n");
}
else
{
SSL_shutdown(con->tls.openssl);
}
//Perform a mutex unlock here
In a multi threaded program with the SSL * pointer shared among multiple threads it may happen that the SSL_shutdown was already called by another thread and this code can protect you from a SIGPIPE signal.

Related

localtime() function is invoking ___lll_lock_wait_private() which is making the thread to go deadlock

I have seen many questions in which ___lll_lock_wait_private () is going into deadlock. But their call stack is different. Their code was calling malloc() or free() or fork().
But in my case, I have a Log class, which is trying to print a log. That log is making my thread to go deadlock.
See below call stack,
#0 0x000000fff4c47b9c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000fff4bf0364 in __tz_convert () from /lib64/libc.so.6
#2 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
#3 0x000000fff5167188 in getTimeStr() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#4 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#5 0x000000fff5318c90 in DaemonCtrlServer::strtDaemon(daemonInfo&) () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#6 0x000000fff531abc0 in DaemonCtrlServer::restrtDiedDaemon(daemonInfo&) () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#7 0x000000fff531ae64 in DaemonCtrlServer::handleChildDeath() () from /usr/sbin/sajet/sharedobj/libDaemonCtlServer.so
#8 0x00000000100080ac in sj_initd::DaemonDeathHandler() ()
#9 0x000000001000b8f8 in sj_initd::SignalHandler(int) ()
#10 0x00000000100080e8 in sj_initd_SigHandler::sj_initdSigHandler(int) ()
#11 <signal handler called>
#12 0x000000fff4bedc00 in __offtime () from /lib64/libc.so.6
#13 0x000000fff4bf02a8 in __tz_convert () from /lib64/libc.so.6
#14 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
#15 0x000000fff5167188 in getTimeStr() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#16 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#17 0x000000fff52e868c in ConnectionOS::ProcessReadEvent() () from /usr/sbin/sajet/sharedobj/libconnV2.so
#18 0x000000fff52ef354 in ConnectionOSManager::ProcessConns(fd_set*, fd_set*) () from /usr/sbin/sajet/sharedobj/libconnV2.so
#19 0x000000fff52f0a0c in SocketsManager::ProcessFds(bool) () from /usr/sbin/sajet/sharedobj/libconnV2.so
#20 0x000000fff52c51b4 in EventReactorBase::IO (this=0x19a65e80) at EventReactorBase.cpp:361
#21 0x000000fff52c457c in EventReactorBase::React (this=0x19a65e80) at EventReactorBase.cpp:419
#22 0x000000fff52c10cc in Task::Run (this=0x19a65e30) at Task.cpp:222
#23 0x000000fff52c1218 in startTask (t=0x19a65e30) at Task.cpp:152
#24 0x000000001000a9c4 in TaskManager::Start() ()
#25 0x0000000010007538 in main ()
sj_init is a daemon which monitors the live status of other daemons in a system. When a daemon dies(which closes the connection with sj_init), it tries to restart that daemon. Then, startDaemon() is trying to print a log, which is calling getTimeStr() which internally calling ___lll_lock_wait_private
Edit
as localtime is not threadsafe, I tried with localtime_r but it also lead the thread to go into a deadlock. but acc. to localtime_r description this is threadsafe function. What am I doing wrong here?
The programs blocks inside a signal handler in __tz_convert() being called from localtime(). The signal handler interrupted __tz_convert() being called from localtime().
#0 0x000000fff4c47b9c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000fff4bf0364 in __tz_convert () from /lib64/libc.so.6
#2 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
...
#11 <signal handler called>
...
#13 0x000000fff4bf02a8 in __tz_convert () from /lib64/libc.so.6
#14 0x000000fff4bee2c0 in localtime () from /lib64/libc.so.6
...
#25 0x0000000010007538 in main ()
localtime() seems to not be reintrant.
A signal handler may only call a very specific set of functions. Scroll down here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03
localtime() is not within this set.
Signal handler should be short and simple.
You could set up a thread doing the formatting and logging which for example via a pipe is being fed by the signal handler with all necessary information to format and log. The functions to do so are listed within the set linked above.
#4 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
#16 0x000000fff516756c in LogClass::logBegin() () from /usr/sbin/sajet/sharedobj/liblibLite.so
Notice that LogClass::logBegin appears in your call stack twice?
You have two core problems.
You are calling localtime from a signal handler. That is not allowed.
You are looking at the wrong thread's stack. This thread got stuck in a deadlock caused by other threads and then was interrupted, getting stuck in the deadlock again. It's a victim of the deadlock (twice!). The perpetrator is another thread.
If you're going to log from a signal handlers, you need very tight control over all the code that runs in the signal handler and you need to pass the "heavy lifting" off to another thread that can safely call non-reentrant functions.

Core dump in zmq library in a multi-threaded application with optimiized binary

This core dump on zmq library happened on field (not reproducible yet) with an optimized binary.
#0 0x00007f44a00801f7 in raise () from /lib64/libc.so.6
#1 0x00007f44a00818e8 in abort () from /lib64/libc.so.6
#2 0x00007f44a1f74759 in zmq::zmq_abort(char const*) () from /lib64/libzmq.so.5
#3 0x00007f44a1fa410d in zmq::tcp_write(int, void const*, unsigned long) () from /lib64/libzmq.so.5
#4 0x00007f44a1f9f417 in zmq::stream_engine_t::out_event() () from /lib64/libzmq.so.5
#5 0x00007f44a1f7437a in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#6 0x00007f44a1fa83a6 in thread_routine () from /lib64/libzmq.so.5
#7 0x00007f44a1b2ce25 in start_thread () from /lib64/libpthread.so.0
#8 0x00007f44a014334d in clone () from /lib64/libc.so.6enter code here
While I am analyzing my application code and hoping to find some misuse of zmq (probably using same zmq socket by 2 different threads or some other memory corruption), I would like to know what else can i get from this core-dump?
For a start, I can see total 102 threads running at the dump time. A many of them are in the epoll_wait.
#0 0x00007f44a0143923 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f44a1f74309 in zmq::epoll_t::loop() () from /lib64/libzmq.so.5
#2 0x00007f44a1fa83a6 in thread_routine () from /lib64/libzmq.so.5
#3 0x00007f44a1b2ce25 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f44a014334d in clone () from /lib64/libc.so.6
The other threads pointing to application code do not look suspicious yet.
The errno printed is 14 = EFAULT (Bad address).
Can i try to get anything from the disassembly? I have not debugged many disassembly in the past. But in this situation if i can get any clue, i can jump-in.
Any (other) advice/pointer will also be highly appreciated.
Thanks.

How to determine reason of pthread_raise(sig=6) in core file with gdb

My app crashes sometime and I cant find the cause. My app is multithread (QThread) and use several QUdpSockets. I think it happens due to the simultaneous access to the socket, but I dont know when and where.
There is results of bt from core file:
#0 0x414596e1 in ?? ()
#1 0x412d731b in pthread_kill (thread=1649, signo=6) at signals.c:69
#2 0x412d76a0 in __pthread_raise (sig=6) at signals.c:200
#3 0x41459395 in ?? ()
#4 0x00000006 in ?? ()
#5 0x41546ff4 in ?? ()
#6 0xbd5fd8bc in ?? ()
#7 0x4145a87d in ?? ()
#8 0x00000006 in ?? ()
#9 0x00000020 in ?? ()
#10 0x00000000 in ?? ()
What is sig=6 and when it emited?
How can I determine the reason of this behavior?
How do I know which -dev libraries are missing (??? positions of the stack)?
Signal number 6 on Linux is SIGABRT - the fact that it's being raised with pthread_raise() seems to indicate that the application has directly called abort() or a failed assert().
It's likely that the missing parts of your backtrace are in the QT libraries, so try installing the debugging symbols for all of those.

SIGSEGV on program exit with boost::log

Some time ago we separate our big project with almost static libraries to many projects with dynamic libraries.
Since then we stated seeing problems on shutdown.
Sometimes, the process would not terminate. With gdb I found, that on object destruction a segfault occurs, but the process is blocked in futex_wait.
I've since improved the code, by creating global objects are now created in function, instead of global static data. That reduced the problem: it doesn't happen in my development environment anymore.
However, in test environment (rare) and in production environment (often) processes still get stuck on shutdown. So we need to restart container manually, or have some kind of health check.
We are trying to simulate this kind of situation on standalone docker container running under Kubernetes where we have the process running under circusd and we see following:
#0 malloc_consolidate (av=0xf47fc400 <main_arena>) at malloc.c:4151
#1 0xf46ff1ab in _int_free (av=0xf47fc400 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4057
#2 0xf48c6e68 in operator delete(void*) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#3 0xf52d173d in std::_Deque_base<boost::log::v2_mt_posix::record_view, std::allocator<boost::log::v2_mt_posix::record_view> >::~_Deque_base() () from /usr/local/lib/liblog.so.0
#4 0xf52d18b3 in std::deque<boost::log::v2_mt_posix::record_view, std::allocator<boost::log::v2_mt_posix::record_view> >::~deque() () from /usr/local/lib/liblog.so.0
#5 0xf52d1940 in boost::log::v2_mt_posix::sinks::bounded_fifo_queue<4000u, boost::log::v2_mt_posix::sinks::drop_on_overflow>::~bounded_fifo_queue() () from /usr/local/lib/liblog.so.0
#6 0xf52d462e in boost::log::v2_mt_posix::sinks::asynchronous_sink<cout_sink, boost::log::v2_mt_posix::sinks::bounded_fifo_queue<4000u, boost::log::v2_mt_posix::sinks::drop_on_overflow>
>::~asynchronous_sink() () from /usr/local/lib/liblog.so.0
#7 0xf52d47f4 in asynchronous_sink<cout_sink>::~asynchronous_sink() () from /usr/local/lib/liblog.so.0
#8 0xf52c199a in boost::detail::sp_counted_impl_pd<asynchronous_sink<cout_sink>*, boost::detail::sp_ms_deleter<asynchronous_sink<cout_sink> >
>::dispose() () from /usr/local/lib/liblog.so.0
#9 0xf51f3e7b in boost::log::v2_mt_posix::core::~core() () from /usr/lib/libboost_log.so.1.58.0
#10 0xf51f6529 in boost::detail::sp_counted_impl_p<boost::log::v2_mt_posix::core>::dispose() () from /usr/lib/libboost_log.so.1.58.0
#11 0xf51f6160 in boost::shared_ptr<boost::log::v2_mt_posix::core>::~shared_ptr() () from /usr/lib/libboost_log.so.1.58.0
#12 0xf46bcfb3 in __cxa_finalize (d=0xf526fa88) at cxa_finalize.c:56
#13 0xf51eaab3 in ?? () from /usr/lib/libboost_log.so.1.58.0
#14 0xf7769e2c in _dl_fini () at dl-fini.c:252
#15 0xf46bcc21 in __run_exit_handlers (status=status#entry=0, listp=0xf47fc3a4 <__exit_funcs>, run_list_atexit=run_list_atexit#entry=true) at exit.c:82
#16 0xf46bcc7d in __GI_exit (status=0) at exit.c:104
#17 0xf46a572b in __libc_start_main (main=0x8060dc0, argc=5, argv=0xffdd1514, init=0x8088090, fini=0x8088100, rtld_fini=0xf7769c50 <_dl_fini>, stack_end=0xffdd150c) at libc-start.c:321
#18 0x080630cc in ?? ()
I have no ideas how to progress from here. What is happening? Why do we get the segfault in boost::log::core destruction in this environment?
Does anyone have some advice how can I find it, probably, based on experience?

Core dump in libc exit call

I am seeing a core dump in solaris at the exit procedure of my program.. How to debug and fix this kind of core dump?
(gdb) where
#0 0xff2cc0c0 in kill () from /usr/lib/libc.so.1
#1 0x0004dac0 in run_before_killed_handler (sig=11) at NdmpServer.cpp:1186
#2 signal handler called
#3 0xfee0ad50 in ?? ()
#4 0x00060a6c in proc_cleanup ()
#5 0xff2421ac in _exithandle () from /usr/lib/libc.so.1
#6 0xff2305d8 in exit () from /usr/lib/libc.so.1
#7 0x0003431c in _start ()
Your program apparently uses atexit(3C) to register an exit handler. The problem is occuring in that handler.
Without knowing the finer details of Solaris memory layouts, 0xfee0ad50 seems to be on the OS side. What OS call are you trying (and failing) to make in proc_cleanup?