Crash Handler in C++ - c++

I've a daemon util, which i need to run without crash. I know i can register for signals and skip all the signals except SIGKILL and i did that too in my application.
My daemon is a multithreaded and i want to know if there's SIGABRT signal raised due to some code in a thread, would that thread exit ..? Or if i skip the SIGABRT signal, that thread will continue running ..?
let's say my app last crashed because of this error
*** error for object 0x101800068: incorrect checksum for freed object - object was probably modified after being freed.
Can i keep my thread running, if it doesn't exits and would it create any issue ..?
I want my application to keep running no matter what. I want my application to recover from the error, like process restart. If i could exit all threads , except my main() during the crash signal and restart all the threads it would be better. But as far as i have noticed, the threads are not exiting during the signals. How can i get all my threads to exit during these signals, so that i can restart them ..?

[too long for a comment]
There are conditions where a thread is forced to go down and if one thread goes down the whole program goes down. That's it.
This is different for processes.
So one approach to build a more robust multi-tasking system would be to use processes instead of threads with having each process be supervised and restarted on crash by another process. The latter of cource could also be supervised and restartet on crash, this in turn then could also be ...
Ok, perhaps it might be more efficient to generate/compose code that does not crash.

Related

How to safely terminate a multithreaded process

I am working on a project where we have used pthread_create to create several child threads.
The thread creation logic is not in my control as its implemented by some other part of project.
Each thread perform some operation which takes more than 30 seconds to complete.
Under normal condition the program works perfectly fine.
But the problem occurs at the time of termination of the program.
I need to exit from main as quickly as possible when I receive the SIGINT signal.
When I call exit() or return from main, the exit handlers and global objects' destructors are called. And I believe these operations are having a race condition with the running threads. And I believe there are many race conditions, which is making hard to solve all of theses.
The way I see it there are two solutions.
call _exit() and forget all de-allocation of resources
When SIGINT is there, close/kill all threads and then call exit() from main thread, which will release resources.
I think 1st option will work, but I do not want to abruptly terminate the process.
So I want to know if it is possible to terminate all child threads as quickly as possible so that exit handler & destructor can perform required clean-up task and terminate the program.
I have gone through this post, let me know if you know other ways: POSIX API call to list all the pthreads running in a process
Also, let me know if there is any other solution to this problem
What is it that you need to do before the program quits? If the answer is 'deallocate resources', then you don't need to worry. If you call _exit then the program will exit immediately and the OS will clean up everything for you.
Be aware also that what you can safely do in a signal hander is extremely limited, so attempting to perform any cleanup yourself is not recommended. If you're interested, there's a list of what you can do here. But you can't flush a file to disk, for example (which is about the only thing I can think of that you might legitimately want to do here). That's off limits.
I need to exit from main as quickly as possible when I receive the SIGINT signal.
How is that defined? Because there's no way to "exit quickly as possible" when you receive one signal like that.
You can either set flag(s), post to semaphore(s), or similar to set a state that tells other threads it's time to shut down, or you can kill the entire process.
If you elect to set flag(s) or similar to tell the other threads to shut down, you set those flags and return from your signal handler and hope the threads behave and the process shuts down cleanly.
If you elect to kill threads, there's effectively no difference in killing a thread, killing the process, or calling _exit(). You might as well just keep it simple and call _exit().
That's all you can chose between when you have to make your decision in a single signal handler call. Pick one.
A better solution is to use escalating signals. For example, when you get SIGQUIT or SIGINT, you set flag(s) or otherwise tell threads it's time to clean up and exit the process - or else. Then, say five seconds later whatever is shutting down your process sends SIGTERM and the "or else" happens. When you get SIGTERM, your signal handler simply calls _exit() - those threads had their chance and they messed it up and that's their fault. Or you can call abort() to generate a core file and maybe provide enough evidence to fix the miscreant threads that won't shut down.
And finally, five seconds later the managing process will nuke the process from orbit with SIGKILL just to be sure.

SIGSEGV does not terminate the process

I have a multithreaded application that runs 30-something threads. I know there is a bug where sometimes two threads attempt to sort one list simultaneously and this usually results in one of the threads accessing invalid memory. Thus, a SIGSEGV is generated for that thread.
Now, from what I understand about signals, the thread should call SIG_DFL for SIGSEGV, which is abnormal termination of the process and a coredump. However, I saw the process being still alive (in a kind of limbo state) where the execution halted, but the process was still alive. When I tried to kill it, the SIGTERM was actually propagated to my custom signal handler (which attempts to nicely shut down all the threads), but there it hang up, because none of the threads were actually executing anymore. I finally managed to kill it using SIGQUIT and the corefile was generated after that.
So my question is what is meant by "abnormal termination"? How can a process not be removed from the memory if the SIG_DFL is called for SIGSEGV? What could possible be going on that caused such behaviour? My Linux is Red Hat Enterprise Linux Server release 5.11 (Tikanga)
EDIT: I know (more or less) how to debug it and I even know what the bug is. My question is more or less: What exactly RedHat does when SIG_DFL is called with SIGSEGV? The problem here is that the process was not responding, but was not dead either - therefore the automatic restart procedure did not kick in and we had some unpleasant downtime.
The situation was not as straightforward as I thought. The original problem was a deadlock between two of the threads. When I issued SIGTERM, then actually my custom signal handler caused a segfault in the (now un-)deadlocked threads.

Terminating Qt worker thread during program shutdown

I use Qt 4.8.6, MS Visual Studio 2008, Windows 7. I've created a GUI program. It contains main GUI thread and worker thread (I have not made QThread subclass, by the way), which makes synchronous calls to 3rd party DLL functions. These functions are rather slow. QTcpServer instance is also under worker thread. My worker class contains QTcpServer and DLL wrapper methods.
I know that quit() is preferred over terminate(), but I don't wanna wait for a minute (because of slow DLL functions) during program shutdown. When I try to terminate() worker thread, I notice warnings about stopping QTcpServer from another thread. What is a correct way of process shutdown?
QThread::quit tells the thread's event loop to exit. After calling it the thread will get finished as soon as the control returns to the event loop of the thread
You may also force a thread to terminate right now via QThread::terminate(), but this is a very bad practice, because it may terminate the thread at an undefined position in its code, which means you may end up with resources never getting freed up and other nasty stuff. So use this only if you really can't get around it.
So i think the right approach is to first tell the thread to quit normally and if something goes wrong and takes much time and you have no way to wait for it, then terminate it:
QThread * th = myWorkerObject->thread();
th->quit();
th->wait(5000); // Wait for some seconds to quit
if(th->isRunning()) // Something took time more than usual, I have to terminate it
th->terminate();
You should always try to avoid killing threads from the outside by force and instead ask them nicely to finish what they're doing. This usually means that the thread checks regularly if it should terminate itself and the outside world tells it to terminate when needed (by setting a flag, signaling an event or whatever is appropriate for the situation at hand).
When a thread is asked to terminate itself, it finishes up what it's doing and exists cleanly. The application waits for the thread to terminate and then exits.
You say that in your case the thread takes a long time to finish. You can take this into consideration and still terminate the thread "the nice way" (for example you can hide the application window and give the impression that the app has exited, even if the process takes a little more time until it finally terminates; or you can show some form of progress indication to the user telling him that the application is shutting down).
Unless there is an overriding reason to do so, you should not attempt to terminate threads with user code at process-termination.
If there is no such reason, just call your OS process termination syscall, eg. ExitProcess(0). The OS can, and will will stop all process threads in any state before releasing all process resources. User code cannot do that, and should not try to terminate threads, or signal them to self-terminate, unless absolutely necessary.
Attempting to 'clean up' with user code sounds 'nice', (aparrently), but is an expensive luxury that you will pay for with extra code, extra testing and extra maintenance.
That is, if your customers don't stop buying your app because they get pissed off with it taking so long to shut down.
The OS is very good at stopping threads and cleaning up. It's had endless thousands of hours of testing during development and decades of life in the wild where problems with process termination would have become aparrent and got fixed. You will not even get close to that with your flags, events etc. as you struggle to stop threads running on another core without the benefit of an interprocessor driver.
There are surely times when you will have to resort to user code to stop threads. If you need to stop them before process termination, or you need to close some DB connection, flush some file at shutdown, deal with interprocess comms or the like issues, then you will have to resort to some of the approaches already suggested in other answers.
If not, don't try to duplicate OS functionality in the name of 'niceness'. Just ask it to terminate your process. You can get your warm, fuzzy feeling when your app shuts down immedately while other developers are still struggling to implement 'Shutdown' progress bars or trying to explain to customers why they have 15 zombie apps still running.

Handling Signals in an MPI Application / Gracefully exit

How can signals be handled safley in and MPI application (for example SIGUSR1 which should tell the application that its runtime has expired and should terminate in the next 10 min.)
I have several constraints:
Finish all parallel/serial IO first befor quitting the application!
In all other circumstances the application can exit without any problem
How can this be achieved safely, no deadlocks while trying to exit, and properly leaving the current context jumping back to main() and calling MPI_FINALIZE() ?
Somehow the processes have to aggree on exiting (I think this is the same in multithreaded applicaitons) but how is that done efficiently without having to communicate to much? Is anybody aware of some standart way of doing this properly?
Below are some thought which might or might not work:
Idea 1:
Lets say for each process we catch the signal in a signal handler and push it on a "unhandled signals stack" (USS) and we simply return from the signal handler routine . We then have certain termination points in our application especially before and after IO operations which then handle all signals in USS.
If there is a SIGUSR1 in USS for example, each process would then exit at a termination point.
This idea has the problem that there could still be deadlocks, process 1 is just catching a singnal befor a termination point, while process 2 passed this point already and is now starting parallel IO. process 1 would exit, which results in a deadlock in process 2 (waiting for process 1 for IO which exited)...
Idea 2:
Only the master process 0 catches the signal in the signal handler and then sends a broadcast message : "all process exit!" at a specific point in the application. All processes receive the broadcast and throw and exception which is catched in main and MPI_FINALIZE is called.
This way the exit happens safely, but for the cost of having to receive continously broadcast message to see if we should exit or not
Thanks a lot!
If your goal is to stop all processes at the same point, then there is no way around always synchronizing at the possible termination points. That is, a collective call at the termination points is required.
Of course, you can try to avoid an extra broadcast by using the synchronization of another collective call to ensure proper termination, or piggy-pack the termination information on an existing broadcast, but I don't think that's worth it. After all, you only need to synchronize before I/O and at least once per ten minutes. At such a frequency, even a broadcast is not a performance problem.
Using signals in your MPI application in general is not safe. Some implementations may support it and others may not.
For instance, in MPICH, SIGUSR1 is used by the process manager for internal notification of abnormal failures.
http://lists.mpich.org/pipermail/discuss/2014-October/003242.html
Open MPI on the other had will forward SIGUSR1 and SIGUSR2 from mpiexec to the other processes.
http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect14
Other implementations will differ. So before you go too far down this route, make sure that the implementation you're using can deal with it.

How to make a new thread and terminate it after some time has elapsed?

The deal is:
I want to create a thread that works similarly to executing a new .exe in Windows, so if that program (new thread) crashes or goes into infinite loop: it will be killed gracefully (after the time limit exceeded or when it crashed) and all resources freed properly.
And when that thread has succeeded, i would like to be able to modify some global variable which could have some data in it, such as a list of files for example. That is why i cant just execute external executable from Windows, since i cant access the variables inside the function that got executed into the new thread.
Edit: Clarified the problem a lot more.
The thread will already run after calling CreateThread.
WaitForSingleObject is not necessary (unless you really want to wait for the thread to finish); but it will not "force-quit" the thread; in fact, force-quitting - even if it might be possible - is never such a good idea; you might e.g. leave resources opened or otherwise leave your application in a state which is no good.
A thread is not some sort of magical object that can be made to do things. It is a separate path of execution through your code. Your code cannot be made to jump arbitrarily around its codebase unless you specifically program it to do so. And even then, it can only be done within the rules of C++ (ie: calling functions).
You cannot kill a thread because killing a thread would utterly wreck some of the most fundamental assumptions a programmer makes. You would now have to take into account the possibility that the next line doesn't execute for reasons that you can neither predict nor prevent.
This isn't like exception handling, where C++ specifically requires destructors to be called, and you have the ability to catch exceptions and do special cleanup. You're talking about executing one piece of code, then suddenly ending the execution of that entire call-stack. That's not going to work.
The reason that web browsers moved from a "thread-per-tab" to "process-per-tab" model is exactly this: because processes can be terminated without leaving the other processes in an unknown state. What you need is to use processes instead of threads.
When the process finishes and sets it's data, you need to use some inter-process communication system to read that data (I like Boost.Interprocess myself). It won't look like a regular C++ global variable, but you shouldn't have a problem with reading it. This way, you can effectively kill the process if it's taking too long, and your program will remain in a reasonable state.
Well, that's what WaitForSingleObject does. It blocks until the object does something (in case of a thread it waits until the thread exits or the timeout elapses). What you need is
HANDLE thread = CreateThread(0, 0, do_stuff, NULL, 0, 0);
//rest of code that will run paralelly with your new thread.
WaitForSingleObject(thread, 4000); // wait 4 seconds or for the other thread to exit
If you want your worker thread to shut down after a period of time has elapsed, the best way to do that is to have the thread itself monitor the elapsed time in some way and then exit when the time is up.
Another way to do this is to monitor the elapsed time in the main thread or even a third, monitor type thread. When the time has elapsed, set an event. Your worker thread could wait for this event in it's main loop, and then exit when it has been raised. These kinds of events, which are used to signal the thread to kill itself, are sometimes called "death events." (Or at least, I call them that.)
Yet another way to do this is to queue a user job to the worker thread, which needs to be in an alterable wait state. The APC can then set some internal state variable which will trigger the death sequence in the thread when it resumes.
There is another method which I hesitate even mentioning, because it should only be used in extremely dire circumstances. You can kill the thread. This is a very dangerous method akin to turning off your sink by detonating an atomic bomb. You get the sink turned off, but there could be other unintended consequences as well. Please don't do this unless you know exactly what you're doing and why.
Remove the call to WaitForSingleObject. That causes your parent thread to wait.
Remove the WaitForSingleObject call?