So my understanding of both pthread_exit and pthread_cancel is that they both cause an exception-like thing called a "forced unwind" to be thrown out of the relevant stack frame in the target thread. This can be caught in order to do thread-specific clean-up, but must be re-thrown or else we get an implicit abort() at the end of the catch block that didn't re-throw.
In the case of pthread_cancel, that happens either immediately on receipt of the associated signal, or the next entry into a cancellation point, or when the signal is next unblocked, depending on the thread's cancellation state and type.
In the case of pthread_exit, the calling thread immediately undergoes a forced unwind.
Fine. This "exception" is a normal part of the process of killing a thread. So why, even when I re-throw it, is it causing std::terminate() to be called, aborting my whole application?
Note that I'm catching and re-throwing the exception a couple times.
Note also that I'm calling pthread_exit out of my SIGTERM signal handler. This works fine in my toy test code, compiled with g++ 4.3.2, which has a thread run signal(SIGTERM, handler_that_calls_pthread_exit) and then sit in a tight while loop until it gets the TERM signal. But it doesn't work in the real application.
Relevant stack frames:
(gdb) where
#0 0x0000003425c30265 in raise () from /lib64/libc.so.6
#1 0x0000003425c31d10 in abort () from /lib64/libc.so.6
#2 0x00000000012b7740 in sv_bsd_terminate () at exception_handlers.cpp:38
#3 0x00002aef65983aa6 in __cxxabiv1::__terminate (handler=0x518)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:43
#4 0x00002aef65983ad3 in std::terminate ()
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:53
#5 0x00002aef65983a5a in __cxxabiv1::__gxx_personality_v0 (
version=<value optimized out>, actions=<value optimized out>,
exception_class=<value optimized out>, ue_header=0x645bcd80,
context=0x645bb940)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libstdc++-v3/libsupc++/eh_personality.cc:657
#6 0x00002aef6524d68c in _Unwind_ForcedUnwind_Phase2 (exc=0x645bcd80,
context=0x645bb940)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libgcc/../gcc/unwind.inc:180
#7 0x00002aef6524d723 in _Unwind_ForcedUnwind (exc=0x645bcd80,
stop=<value optimized out>, stop_argument=0x645bc1a0)
at /view/ken_gcc_4.3/vobs/Compiler/gcc/libgcc/../gcc/unwind.inc:212
#8 0x000000342640cf80 in __pthread_unwind () from /lib64/libpthread.so.0
#9 0x00000034264077a5 in pthread_exit () from /lib64/libpthread.so.0
#10 0x0000000000f0d959 in threadHandleTerm (sig=<value optimized out>)
at osiThreadLauncherLinux.cpp:46
#11 <signal handler called>
Thanks!
Eric
Note also that I'm calling
pthread_exit out of my SIGTERM signal
handler.
This is your problem. To quote from the POSIX specs (http://pubs.opengroup.org/onlinepubs/009695399/functions/signal.html):
If the signal occurs other than as the result of calling abort(), raise(), kill(), pthread_kill(), or sigqueue(), the behavior is undefined if the signal handler refers to any object with static storage duration other than by assigning a value to an object declared as volatile sig_atomic_t, or if the signal handler calls any function in the standard library other than one of the functions listed in Signal Concepts.
The list of permitted functions is given at http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03, and does not include pthread_exit(). Therefore your program is exhibiting undefined behaviour.
I can think of three choices:
Set a flag in the signal handler which is checked by the thread periodically, rather than trying to exit directly from the signal handler.
Use sigwait() to explicitly wait for the signal on an independent thread. This thread can then explicitly call pthread_cancel() on the thread you wish to exit.
Mask the signal, and call sigpending() periodically on the thread that is to be exited, and exit if the signal is pending.
Related
I'm using zmq version 4.2.2. My program crashes because of a call to zmq_abort() which calls abort(). According to stack trace, if I understand correctly, zmq_abort() is called from src/socket_poller.cpp:54. However, that line is the beginning of the function definition:
zmq::socket_poller_t::~socket_poller_t ()
The function does not have direct calls to zmq_abort() or any assert macros that would call it. There are not many asserts or any direct calls to zmq_abort() in the whole file either. However, other lines in the stack trace seem to match the source code in github:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/socket_poller.cpp#L54
How does execution end up in zmq_abort()?
Beginning of stack trace:
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f12cc9d4700 (LWP 23680))]
(gdb) where
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f12ce123415 in __GI_abort () at abort.c:90
#2 0x00007f12ce8db9c9 in zmq::zmq_abort (errmsg_=<optimized out>) at ../zeromq-4.2.2/src/err.cpp:87
#3 0x00007f12ce918cbe in zmq::socket_poller_t::~socket_poller_t (this=0x7f12c8004150,
__in_chrg=<optimized out>) at ../zeromq-4.2.2/src/socket_poller.cpp:54
#4 0x00007f12ce91793a in zmq_poller_destroy (poller_p_=0x7f12cc9d2af8)
at ../zeromq-4.2.2/src/zmq.cpp:1236
#5 0x00007f12ce917e14 in zmq_poller_poll (timeout_=<optimized out>, nitems_=2, items_=0x1)
at ../zeromq-4.2.2/src/zmq.cpp:854
#6 zmq_poll (items_=items_#entry=0x7f12cc9d2c20, nitems_=nitems_#entry=2, timeout_=timeout_#entry=5000)
at ../zeromq-4.2.2/src/zmq.cpp:866
zmq_abort() was called from an assertation macro in signaler_t's destructor:
https://github.com/zeromq/libzmq/blob/v4.2.2/src/signaler.cpp#L143. The signaler_t object is a member of socket_poller_t. I don't know for sure why the call to the destructor is not shown in the stack trace.
I was trying not to ask (directly) what was wrong with my code because it was infeasible to provide a code sample, but I'll mention that it turned out to be that a file descriptor was erroneously closed twice in another thread. Between the two close operations, zmq_poll() created a socket_poller_t object. signaler_t's constructor opened an eventfd, which was the same fd (number) that had been closed earlier. Then, the other thread closed the same fd again, leading the destructor to get EBADF on close() and callzmq_abort().
I am supporting an application written in C++ over many years and as of late it has started to crash providing core dumps that we don't know how to handle.
It runs on an appliance on Ubuntu 14.04.5
When loading the core file in GDB it says that:
Program terminated with signal SIGABRT, Aborted
I can inspect 230 threads but they are all in wait() in the exact same memory position.
There is a thread with ID 1 that in theory could be the responsible but that thread is also in wait.
So I have two questions basically.
How does the id index of the threads work?
Is thread with GDB ID 1 the last active thread? or is that an arbitrary index and the failure can be in any of the other threads?
How can all threads be in wait() when a SIGABRT is triggered?
Shouldn't the instruction pointer be at the failing command when the OS decided to step in an halt the process? Or is it some sort of deadlock protection?
Any help much appreciated.
Backtrace of thread 1:
#0 0xf771dcd9 in ?? ()
#1 0xf74ad4ca in _int_free (av=0x38663364, p=<optimized out>,have_lock=-186161432) at malloc.c:3989
#2 0xf76b41ab in std::string::_Rep::_M_destroy(std::allocator<char> const&) () from /usr/lib32/libstdc++.so.6
#3 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#4 0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#5 0x5685e8b4 in SlimStringMapper::~SlimStringMapper() ()
#6 0x567d6bc3 in destroy ()
#7 0x566a40b4 in HttpProxy::getLogonCredentials(HttpClient*, HttpServerTransaction*, std::string const&, std::string const&, std::string&, std::string&) ()
#8 0x566a5d04 in HttpProxy::add_authorization_header(HttpClient*, HttpServerTransaction*, Hosts::Host*) ()
#9 0x566af97c in HttpProxy::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#10 0x566d597e in callOnClientRequest(HttpClient*, HttpServerTransaction*, FastHttpRequest*) ()
#11 0x566d169f in GateKeeper::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#12 0x566a2291 in HttpClientThread::run() ()
#13 0x5682e37c in wa_run_thread ()
#14 0xf76f6f72 in start_thread (arg=0xec65ab40) at pthread_create.c:312
#15 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#16 0xec65ab40 in ?? ()
Another thread that should be in wait:
#0 0xf771dcd9 in ?? ()
#1 0x5682e37c in wa_run_thread ()
#2 0xf76f6f72 in start_thread (arg=0xf33bdb40) at pthread_create.c:312
#3 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#4 0xf33bdb40 in ?? ()
Best regards
Jon
How can all threads be in wait() when a SIGABRT is triggered?
Is wait the POSIX function, or something from the run-time environment? Are you looking at a higher-level backtrace?
Anyway, there is an easy explanation why this can happen: SIGABRT was sent to the process, and not generated by a thread in a synchronous fashion. Perhaps a coworker sent the signal to create the coredump, after observing the deadlock, to collect evidence for future analysis?
How does the id index of the threads work? Is thread with GDB ID 1 the last active thread?
When the program is running under GDB, GDB numbers threads as it discovers them, so thread 1 is always the main thread.
But when loading a core dump, GDB discoveres threads in the order in which the kernel saved them. The kernels that I have seen always save the thread which caused program termination first, so usually loading core into GDB immediately gets you to the crash point without the need to switch threads.
How can all threads be in wait() when a SIGABRT is triggered?
One possiblity is that you are not analyzing the core correctly. In particular, you need exact copies of shared libraries that were used at the time when the core was produced, and that's unlikely to be the case when the application runs on "appliance" and you are analysing core on your development machine. See this answer.
I just saw your question. First of all my answer is not specific to you direct question but some solution to handle this kind of situation. Multi-threading entirely depend on the hardware and operating system of a machine. Especially memory and processors. Increase in thread means requirement of more memory as well as more time slice for processor. I don’t think your application have more than 100 processor to facilitate 230 thread to run concurrently with highest performance. To avoid this situation do the below steps which may help you.
Control the creation of threads. Control number of threads running concurrently.
Increase the memory size of your application. (check compiler options to increase memory for the application at run time or O/S to allocate enough memory)
Set grid size and stack size of each thread properly. (calculation need to be done based on your application’s threads functionality, this is bit complicated. Please read some documentation)
Handle synchronized block properly to avoid any deadlock.
Where necessary use conditional lock etc.
As you told that most of your threads are in wait condition, that means they are waiting for a lock to release for their turn, that means one of the thread already acquire the lock and still busy in processing or probably in deadlock situation.
I have a exit handler thread waiting on a condition for the worker thread to do its work. The signalling is done from the worker thread's destructor.
Below is the code of the exit handler thread.
void Class::TaskExitHandler::run() throw()
{
while( ! isInterrupted() ) {
_book->_eot_cond.wait(); // Waiting on this condition
{
CLASS_NAMESPACE::Guard<CLASS_NAMESPACE::FastLock> eguard(_book->_exitlist_lock);
list<TaskGroupExecutor*>::const_iterator itr = _book->_exited_tasks.begin();
for( ; itr != _book->_exited_tasks.end(); itr++ ) {
(*itr)->join();
TRACER(TRC_DEBUG)<< "Deleting exited task:" << (*itr)->getLoc() << ":"
<< (*itr)->getTestID() << ":" << (*itr)->getReportName() << endl;
delete (*itr);
}
_book->_exited_tasks.clear();
}
_book->executeAny();
}
}
}
Now, what has been observed is that when the worker thread catches any exception(raised from a lower layer), this thread is continued, and immediately cores with exit code 134, which is SIGABRT.
The stacktrace is as follows-
#0 0x0000005555f49b4c in raise () from /lib64/libc.so.6
#1 0x0000005555f4b568 in abort () from /lib64/libc.so.6
#2 0x0000005555d848b4 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6
#3 0x0000005555d82210 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x0000005555d82258 in std::terminate () from /usr/lib64/libstdc++.so.6
#5 0x0000005555d82278 in ?? () from /usr/lib64/libstdc++.so.6
#6 0x0000005555d81b18 in __cxa_call_unexpected () from /usr/lib64/libstdc++.so.6
#7 0x0000000120047898 in Class::TaskExitHandler::run ()
#8 0x000000012001cd38 in commutil::ThreadBase::thread_proxy ()
#9 0x0000005555c6e438 in start_thread () from /lib64/libpthread.so.0
#10 0x0000005555feed6c in __thread_start () from /lib64/libc.so.6
Backtrace stopped: frame did not save the PC
So it seems that this run() function which specifies that it will not throw any exceptions using "throw()" spec, raises an exception(from Frame 4). As per various references about __cxa_call_unexpected(), the stacktrace depicts the typical behaviour of compiler to abort when exception is raised in a function with "throw()" spec.
Am I right with the analysis of the problem?
To test, I added a try catch in this method, and printed the exception message. Now the process didn't core. The exception message was same as the one caught by worker thread.
My question is, how does this thread get access to the exception caught by the other? Do they share some datastructure related to exception handling?
Please throw some light on this. It is quite puzzling..
Note:- As per stacktrace, the call_unexpected is raised immediately after run() is called. That strengthens my doubt that somehow exception stack or data is shared. But didn't find any references to this behaviour.
I shall answer my own question.
What has happened in this case was there was a destructor being invoked in the TaskExitHandler thread. This destructor was performing the same operation which caused the exception in the main thread.
As the TaskExitHandler thread was designed to not throw(or rather expected), there were no try-catch blocks, and hence process aborted when the exception was raised.
As the destructor's call was implicit, it never displayed in stacktrace making it very difficult to find. Each object had to be tracked down to find this exception leakage.
Thanks everyone for the active participation :) this was my first question to get some active responses..
I'll take a stab - hopefully this will give you enough to continue your research.
I suspect the thread running TaskExitHandler is the parent thread for all of the worker threads. TEH would have a hard(er) time joining up with the children otherwise.
The child / worker threads are not handling the exceptions thrown to them. However, an exception must be handled somewhere or the entire process will get shut down. The parent thread (aka TEH) is the last stop in the process's stack / chain for handling exceptions. Your sample code shows that TEH's exception handling is to simply throw / not handle the exception. So it cores out.
It's not necessarily a data structure that's being shared, but rather the process / thread IDs and memory space. The child threads do share global memory / heap space with the parent and each other, hence the need for semaphores and / or mutexes for locking purposes.
Good encapsulation dictates that the worker threads should be smart enough to handle any / all exceptions they might see. That way, the individual worker thread can be killed off instead of bringing down the parent thread and the rest of the process tree. OTW, you can continue catching the exception(s) in TEH, but it's really unlikely that thread has (or should have) the knowledge of what to do with the exception.
Add a comment if the above isn't clear, I'm happy to explain further.
I did a little research and confirmed that exceptions are generated against heap memory, not stack memory. All the threads of your process share the same heap*, so it makes more sense (at least to me) why the parent thread would see the exception when the child thread doesn't catch it. *FWIW, if you fork your process instead of starting a new thread, you'll get a new heap as well. However, forking is an expensive operation against the memory since you're copying all the heap contents over to the new process as well.
This SO thread discusses setting up a thread to catch all exceptions, which will probably be of interest:
catching exceptions from another thread
I have an ip::udp::socket constructed with an io_service. There is only one boost::thread which calls the io_service::run() method, and an instance of io_service::work to prevent io_service::run() from returning. The completion handlers for my ip::udp::socket have custom asio_handler_allocate() and asio_handler_deallocate() functions, which are backed by a my::custom_memory_pool.
When my application quits, this sequence of events occurs on on my shutting-down thread:
ip::udp::socket::close()
work::~work()
io_service::stop()
thread::join()
my::custom_memory_pool::~custom_memory_pool()
ip::udp::socket::~socket()
thread::~thread()
io_service::~io_service()
In step 8, the call to io_service::~io_service() causes...
Program terminated with signal 11, Segmentation fault.
#0 0x00000000005ad93c in my::custom_memory_pool<boost::aligned_storage<512u, -1u> >::deallocate (this=0x36323f8, t=0x7fca97a07880)
at memory.hpp:82
82 reinterpret_cast<pool_node*>(t)->next_ = head_;
(gdb) bt 30
#0 0x00000000005ad93c in my::custom_memory_pool<boost::aligned_storage<512u, -1u> >::deallocate (this=0x36323f8, t=0x7fca97a07880)
at memory.hpp:82
#1 0x00000000005ad40a in asio_handler_deallocate (p=0x7fca97a07880, s=96, h=0x7fffe09d5480) at net.cpp:22
#2 0x0000000000571a07 in boost_asio_handler_alloc_helpers::deallocate<socket_multicast::completion_handler> (p=0x7fca97a07880, s=96, h=...)
at /usr/include/boost/asio/detail/handler_alloc_helpers.hpp:51
#3 0x0000000000558256 in boost::asio::detail::reactive_socket_recvfrom_op<boost::asio::mutable_buffers_1, boost::asio::ip::basic_endpoint<boost::asio::ip::udp>, socket_multicast::completion_handler>::ptr::reset (this=0x7fffe09d54b0)
at /usr/include/boost/asio/detail/reactive_socket_recvfrom_op.hpp:81
#4 0x0000000000558310 in boost::asio::detail::reactive_socket_recvfrom_op<boost::asio::mutable_buffers_1, boost::asio::ip::basic_endpoint<boost::asio::ip::udp>, socket_multicast::completion_handler>::do_complete (owner=0x0, base=0x7fca97a07880)
at /usr/include/boost/asio/detail/reactive_socket_recvfrom_op.hpp:112
#5 0x0000000000426706 in boost::asio::detail::task_io_service_operation::destroy (this=0x7fca97a07880)
at /usr/include/boost/asio/detail/task_io_service_operation.hpp:41
#6 0x000000000042841b in boost::asio::detail::task_io_service::shutdown_service (this=0xd4df30)
at /usr/include/boost/asio/detail/impl/task_io_service.ipp:96
#7 0x0000000000426388 in boost::asio::detail::service_registry::~service_registry (this=0xd4a320, __in_chrg=<value optimized out>)
at /usr/include/boost/asio/detail/impl/service_registry.ipp:43
#8 0x0000000000428e99 in boost::asio::io_service::~io_service (this=0xd49f38, __in_chrg=<value optimized out>)
at /usr/include/boost/asio/impl/io_service.ipp:51
So the io_service::~io_service() is trying to deallocate some memory to the pool that I destroyed back in step 5.
I can't move my::custom_memory_pool::~custom_memory_pool() to after io_service::~io_service().
I expected that after io_service::stop() and thread::join() returns, there could be no more asio_handler_deallocate() calls. Apparently that's not the case. What can I do in step 3 to force io_service to dequeue all of its completion events and deallocate all of its handler memory, and how can I block until io_service finishes those tasks?
Here is the answer: When tearing down an io_service and its services, don't call io_service::stop() at all. Just work::~work().
io_service::stop() is really only for temporarily suspending the io_service so that it may be io_service::reset() later. An ordinary graceful shutdown of io_service should not involve io_service::stop().
Calling io_service::stop() doesn't allow it to finish any handlers, it simply stops processing after any currently running handler and returns as soon as possible.
Since the various internal structures are destroyed when the io_service is destroyed, one solution is to control the order that the io_service is destroyed relative to the custom allocator. Either order them correctly if they're in the same structure (have the allocator prior to the io_service in the structure), or use the heap and explicitly order their destruction to guarantee that the io_service is destroyed first.
I have a program that brings up and tears down multiple threads throughout its life. Everything works great for awhile, but eventually, I get the following core dump stack trace.
#0 0x009887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x007617a5 in raise () from /lib/tls/libc.so.6
#2 0x00763209 in abort () from /lib/tls/libc.so.6
#3 0x003ec1bb in __gnu_cxx::__verbose_terminate_handler () from /usr/lib/libstdc++.so.6
#4 0x003e9ed1 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.6
#5 0x003e9f06 in std::terminate () from /usr/lib/libstdc++.so.6
#6 0x003ea04f in __cxa_throw () from /usr/lib/libstdc++.so.6
#7 0x00d5562b in boost::thread::start_thread () from /h/Program/bin/../lib/libboost_thread-gcc34-mt-1_39.so.1.39.0
At first, I was leaking threads, and figured the core was due to hitting some maximum limit of number of current threads, but now it seems that this problems occurs even when I don't. For reference, in the core above there were 13 active threads executing.
I did some searching to try and figure out why start_thread would core, but I didn't come across anything. Anyone have any ideas?
start_thread is throwing an uncaught exception, see which exceptions can start_thread throw and place a catch around it to see what is the problem.
What are the values carried by thread_resource_error? It looks like you can call native_error() to find out.
Since this is a wrapper around pthreads there are only a couple of possibilities - EAGAIN, EINVAL and EPERM. It looks as if boost has exceptions it would likely throw for EINVAL and EPERM - i.e. unsupported_thread_option() and thread_permission_error().
That pretty much leaves EAGAIN so I would double check that you really aren't exceeding the system limits on the number of threads. You are sure you are joining them, or if detached, they are really gone?