I have a multithreaded application that runs 30-something threads. I know there is a bug where sometimes two threads attempt to sort one list simultaneously and this usually results in one of the threads accessing invalid memory. Thus, a SIGSEGV is generated for that thread.
Now, from what I understand about signals, the thread should call SIG_DFL for SIGSEGV, which is abnormal termination of the process and a coredump. However, I saw the process being still alive (in a kind of limbo state) where the execution halted, but the process was still alive. When I tried to kill it, the SIGTERM was actually propagated to my custom signal handler (which attempts to nicely shut down all the threads), but there it hang up, because none of the threads were actually executing anymore. I finally managed to kill it using SIGQUIT and the corefile was generated after that.
So my question is what is meant by "abnormal termination"? How can a process not be removed from the memory if the SIG_DFL is called for SIGSEGV? What could possible be going on that caused such behaviour? My Linux is Red Hat Enterprise Linux Server release 5.11 (Tikanga)
EDIT: I know (more or less) how to debug it and I even know what the bug is. My question is more or less: What exactly RedHat does when SIG_DFL is called with SIGSEGV? The problem here is that the process was not responding, but was not dead either - therefore the automatic restart procedure did not kick in and we had some unpleasant downtime.
The situation was not as straightforward as I thought. The original problem was a deadlock between two of the threads. When I issued SIGTERM, then actually my custom signal handler caused a segfault in the (now un-)deadlocked threads.
Related
I'm building plugins for a host application using C++11/14, for now targeting Windows and MacOS. The plugins start up async worker threads when the host app starts us up, and if they're still running when the host shuts the plugins down they get signaled to stop. Some of these worker threads are started with std::async so I can use an std::future to get the thread result back, while other less involved threads are just std::threads which I ultimately just join to see when they're done. It all works nicely this way.
Unless the host decides not to call our shutdown procedure when it shuts down itself... Yeah, I know, but it really is that bad sometimes -- it often enough just crashes during shutdown. And they even plan to make that into a 'feature' and call it "Fast Exit" to please their users; just pull the plug and we're done extra fast :(
For that case I have registered an std::atexit handler. It last-minute signals any still running threads to exit NOW (atomic bools and/or signals to wake them up), then it waits a second to give the threads some time to respond, and finally it detaches the regular std::thread threads and hopes for the best. This way at least the threads get a heads up to quickly write intermediate state to disk for a next round (if needed), and quit writing to probably already deceased data structures, thus avoiding crashes which would make any crash dump point the finger at my plugins.
However, atexit handlers run at OS DLL unload time, so I'm not even allowed to use thread synchronization (right?). And under the debugger I just saw all of the worker threads were presumably already killed by the OS, since the atexit handler's thread was the only thread left under the debugger. Needless to say, all remaining std::futures went into full blocking mode, hanging up the remaining corpse of the dead host app...
Is there a way to abandon an std::future? In MS Visual C++ I saw futures have an _Abandon method, but that's too platform specific (and undocumented) for my taste. Or is my only recourse to not use std::future, do all thread communication via my own data structures and synchronization, and work with simple std::threads which can just be detached?
I am working on a project where we have used pthread_create to create several child threads.
The thread creation logic is not in my control as its implemented by some other part of project.
Each thread perform some operation which takes more than 30 seconds to complete.
Under normal condition the program works perfectly fine.
But the problem occurs at the time of termination of the program.
I need to exit from main as quickly as possible when I receive the SIGINT signal.
When I call exit() or return from main, the exit handlers and global objects' destructors are called. And I believe these operations are having a race condition with the running threads. And I believe there are many race conditions, which is making hard to solve all of theses.
The way I see it there are two solutions.
call _exit() and forget all de-allocation of resources
When SIGINT is there, close/kill all threads and then call exit() from main thread, which will release resources.
I think 1st option will work, but I do not want to abruptly terminate the process.
So I want to know if it is possible to terminate all child threads as quickly as possible so that exit handler & destructor can perform required clean-up task and terminate the program.
I have gone through this post, let me know if you know other ways: POSIX API call to list all the pthreads running in a process
Also, let me know if there is any other solution to this problem
What is it that you need to do before the program quits? If the answer is 'deallocate resources', then you don't need to worry. If you call _exit then the program will exit immediately and the OS will clean up everything for you.
Be aware also that what you can safely do in a signal hander is extremely limited, so attempting to perform any cleanup yourself is not recommended. If you're interested, there's a list of what you can do here. But you can't flush a file to disk, for example (which is about the only thing I can think of that you might legitimately want to do here). That's off limits.
I need to exit from main as quickly as possible when I receive the SIGINT signal.
How is that defined? Because there's no way to "exit quickly as possible" when you receive one signal like that.
You can either set flag(s), post to semaphore(s), or similar to set a state that tells other threads it's time to shut down, or you can kill the entire process.
If you elect to set flag(s) or similar to tell the other threads to shut down, you set those flags and return from your signal handler and hope the threads behave and the process shuts down cleanly.
If you elect to kill threads, there's effectively no difference in killing a thread, killing the process, or calling _exit(). You might as well just keep it simple and call _exit().
That's all you can chose between when you have to make your decision in a single signal handler call. Pick one.
A better solution is to use escalating signals. For example, when you get SIGQUIT or SIGINT, you set flag(s) or otherwise tell threads it's time to clean up and exit the process - or else. Then, say five seconds later whatever is shutting down your process sends SIGTERM and the "or else" happens. When you get SIGTERM, your signal handler simply calls _exit() - those threads had their chance and they messed it up and that's their fault. Or you can call abort() to generate a core file and maybe provide enough evidence to fix the miscreant threads that won't shut down.
And finally, five seconds later the managing process will nuke the process from orbit with SIGKILL just to be sure.
I've a daemon util, which i need to run without crash. I know i can register for signals and skip all the signals except SIGKILL and i did that too in my application.
My daemon is a multithreaded and i want to know if there's SIGABRT signal raised due to some code in a thread, would that thread exit ..? Or if i skip the SIGABRT signal, that thread will continue running ..?
let's say my app last crashed because of this error
*** error for object 0x101800068: incorrect checksum for freed object - object was probably modified after being freed.
Can i keep my thread running, if it doesn't exits and would it create any issue ..?
I want my application to keep running no matter what. I want my application to recover from the error, like process restart. If i could exit all threads , except my main() during the crash signal and restart all the threads it would be better. But as far as i have noticed, the threads are not exiting during the signals. How can i get all my threads to exit during these signals, so that i can restart them ..?
[too long for a comment]
There are conditions where a thread is forced to go down and if one thread goes down the whole program goes down. That's it.
This is different for processes.
So one approach to build a more robust multi-tasking system would be to use processes instead of threads with having each process be supervised and restarted on crash by another process. The latter of cource could also be supervised and restartet on crash, this in turn then could also be ...
Ok, perhaps it might be more efficient to generate/compose code that does not crash.
Is it possible with a C++ program to monitor which processes gets killed (either by the user or by the OS), or if the process terminates for some other reasons which are not segmentation fault or illegal operations, and perform some actions afterwards?
Short answer, yes it's possible.
Long answer:
You will need to implement signal handlers for the different signals that may kill a process. You can't necessarily catch EVERY type of signal (in particular, SIGKILL is not possible to catch since that would potentially make a process unkillable).
Use the sigaction function call to set up your signal handlers.
There is a decent list of which signals do what here (about 1/3 down from the top):
http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html
Edit: Sorry, thought you meant within the process, not from outside of the process. If you "own" the process, you can use ptrace and it's PTRACE_GETSIGINFO to get what the signal was.
To generally "find processes killed" would be quite difficult - or at least to tell the difference between processes just exiting on their own, as opposed to those that exit because they are killed for some other reason.
In multi thread programming, what if one of worker thread is unexpectedly exited and main thread needs to know whether that thread is alive or not.
Is there any way to check this?
I was wondering if there is a typical signal that is made when worker thread is exited.
(Linux)
Thank you
If threads are unexpectedly dying in your program, it is toast. If you want fault isolation, with recovery, use multiple processes (with shared memory) instead of, or inaddition to threads. On POSIX (and Win32 also) you can detect if the owner of a process-shared mutex died while holding that mutex and implement some "fsck-like" check and repair of the shared data to try to restore its invariants. (Obviously it helps you if the data structure is designed with recoverable transactions in mind.)
On Win32 you can use Windows structured exception handling (SEH) to catch any kind of exception in a thread. (For instance access violation, division by zero, ...). Using the tool help API you can gain a list of the attached modules, and there are interfaces for reading the machine registers, faulting address, etc.
In POSIX you can do that with signal handling. Events like access violations and such deliver signals to the thread to which they pertain.
It doesn't seem realistic to code up these pieces into a recovery strategy that tries to keep a buggy program running.