C++ graceful shutdown best practices

C++ graceful shutdown best practices - c++

I'm writing a multi-threaded c++ application for *nix operating systems. What are some best practices for terminating such an application gracefully? My instinct is that I'd want to install a signal handler on SIGINT (SIGTERM?) which stops/joins my threads. Also, is it possible to "guarantee" that all destructors are called (provided no other errors or exceptions are thrown while handling the signal)?

Some considerations come to mind:
designate 1 thread to be responsible for orchestrating the shutdown, eg, as Dithermaster suggested, this could be the main thread if you are writing a standalone application. Or if you are writing a library, provide an interface (eg function call) whereby a client program can terminate the objects created within the library.
you cannot guarantee destructors are called; that is up to you, and requires carefully calling delete for each new. Maybe smart pointers will help you. But, really, this is a design consideration. The major components should have start & stop semantics, which you could choose to invoke from the class constructor & destructor.
the shutdown sequence for a set of interacting objects is something that can require some effort to get correct. E.g., before you delete an object, are you sure some timer mechanism is not going to try calling it in few micro/milli/seconds later? Trial and error is your friend here; develop a framework which can repeatedly & rapidly start and stop your application to tease out shutdown related race-conditions.
signals are one way to trigger an event; others might be periodically polling for a known file, or opening a socket and receiving some data on it. Either way, you want to decouple the shutdown sequence code from the trigger event.

My recommendation is that the main thread shut down all worker threads before exiting itself. Send each worker an event telling it to clean up and exit, and wait for each one to do so. This will allow all C++ destructors to run.

Regarding signal management, the only thing you can portably and safely do inside a signal handler is to write to a variable of type sig_atomic_t (possibly volatile-qualified) and return. In general, you cannot call most functions and must not write to global memory. In other words, the handler should just set a flag to be tested inside your main routine, at some point you find appropriate, and the action resulting from the signal itself should be performed from there.
(Since there might be blocking I/O involved, consider studying POSIX Thread Cancellation. Your Unix clone (most notably Linux) might have peculiarities with respect to this and to the above.)
Regarding destructors, no magic is involved. They will be executed if control leaves a given scope through any means defined in the language. Leaving a scope through other means (for example, longjmp() or even exit()) does not trigger destructors.
Regarding general shutdown practices, there are divergent opinions on the field.
Some state that a "graceful termination", in the sense of releasing every resource ever allocated, should be performed. In C++, this usually means that all destructors should be properly executed before the process terminates. This is tricky in practice and often a source of much grief, specially in multithreaded programs, for a variety of reasons. Signals further complicate things by the very nature of asynchronous signal dispatching.
Because most of this work is totally useless, some others, like me, contend that the program must just terminate immediately, possibly shortly after undoing persistent changes to the system (like removing temporary files or restoring the screen resolution) and saving configuration. An apparently tidier cleanup is not only a waste of time (because the operating system will clean up most things like allocated memory, dangling threads and open file descriptors), but might be a serious waste of time (deallocators might touch paged out memory, uselessly forcing the system to page them in just for releasing them soon after the process terminates, for example), not mentioning the possibility of deadlocks being originated from joining threads.
Just say no. When you want to leave, call exit() (or even _exit(), but watch out for unflushed I/O) and that's it. More annoying than slow starting programs are slow terminating programs.

Related

Cancelling arbitary jobs running in a thread_pool

Is there a way for a thread-pool to cancel a task underway? Better yet, is there a safe alternative for on-demand cancelling opaque function calls in thread_pools?
Killing the entire process is a bad idea and using native handle to perform pthread_cancel or similar API is a last resort only.
Extra
Bonus if the cancellation is immediate, but it's acceptable if the cancellation has some time constraint 'guarantees' (say cancellation within 0.1 execution seconds of the thread in question for example)
More details
I am not restricted to using Boost.Thread.thread_pool or any specific library. The only limitation is compatibility with C++14, and ability to work on at least BSD and Linux based OS.
The tasks are usually data-processing related, pre-compiled and loaded dynamically using C-API (extern "C") and thus are opaque entities. The aim is to perform compute intensive tasks with an option to cancel them when the user sends interrupts.
While launching, the thread_id for a specific task is known, and thus some API can be sued to find more details if required.
Disclaimer
I know using native thread handles to cancel/exit threads is not recommended and is a sign of bad design. I also can't modify the functions using boost::this_thread::interrupt_point, but can wrap them in lambdas/other constructs if that helps. I feel like this is a rock and hard place situation, so alternate suggestions are welcome, but they need to be minimally intrusive in existing functionality, and can be dramatic in their scope for the feature-set being discussed.
EDIT:
Clarification
I guess this should have gone in the 'More Details' section, but I want it to remain separate to show that existing 2 answers are based o limited information. After reading the answers, I went back to the drawing board and came up with the following "constraints" since the question I posed was overly generic. If I should post a new question, please let me know.
My interface promises a "const" input (functional programming style non-mutable input) by using mutexes/copy-by-value as needed and passing by const& (and expecting thread to behave well).
I also mis-used the term "arbitrary" since the jobs aren't arbitrary (empirically speaking) and have the following constraints:
some which download from "internet" already use a "condition variable"
not violate const correctness
can spawn other threads, but they must not outlast the parent
can use mutex, but those can't exist outside the function body
output is via atomic<shared_ptr> passed as argument
pure functions (no shared state with outside) **
** can be lambda binding a functor, in which case the function needs to makes sure it's data structures aren't corrupted (which is the case as usually, the state is a 1 or 2 atomic<inbuilt-type>). Usually the internal state is queried from an external db (similar architecture like cookie + web-server, and the tab/browser can be closed anytime)
These constraints aren't written down as a contract or anything, but rather I generalized based on the "modules" currently in use. The jobs are arbitrary in terms of what they can do: GPU/CPU/internet all are fair play.
It is infeasible to insert a periodic check because of heavy library usage. The libraries (not owned by us) haven't been designed to periodically check a condition variable since it'd incur a performance penalty for the general case and rewriting the libraries is not possible.

Is there a way for a thread-pool to cancel a task underway?
Not at that level of generality, no, and also not if the task running in the thread is implemented natively and arbitrarily in C or C++. You cannot terminate a running task prior to its completion without terminating its whole thread, except with the cooperation of the task.
Better
yet, is there a safe alternative for on-demand cancelling opaque
function calls in thread_pools?
No. The only way to get (approximately) on-demand preemption of a specific thread is to deliver a signal to it (that is is not blocking or ignoring) via pthread_kill(). If such a signal terminates the thread but not the whole process then it does not automatically make any provision for freeing allocated objects or managing the state of mutexes or other synchronization objects. If the signal does not terminate the thread then the interruption can produce surprising and unwanted effects in code not designed to accommodate such signal usage.
Killing the entire process is a bad idea and using native handle to
perform pthread_cancel or similar API is a last resort only.
Note that pthread_cancel() can be blocked by the thread, and that even when not blocked, its effects may be deferred indefinitely. When the effects do occur, they do not necessarily include memory or synchronization-object cleanup. You need the thread to cooperate with its own cancellation to achieve these.
Just what a thread's cooperation with cancellation looks like depends in part on the details of the cancellation mechanism you choose.

Cancelling a non cooperative, not designed to be cancelled component is only possible if that component has limited, constrained, managed interactions with the rest of the system:
the ressources owned by the components should be managed externally (the system knows which component uses what resources)
all accesses should be indirect
the modifications of shared ressources should be safe and reversible until completion
That would allow the system to clean up resource, stop operations, cancel incomplete changes...
None of these properties are cheap; all the properties of threads are the exact opposite of these properties.
Threads only have an implied concept of ownership apparent in the running thread: for a deleted thread, determining what was owned by the thread is not possible.
Threads access shared objects directly. A thread can start modifications of shared objects; after cancellation, such modifications that would be partial, non effective, incoherent if stopped in the middle of an operation.
Cancelled threads could leave locked mutexes around. At least subsequent accesses to these mutexes by other threads trying to access the shared object would deadlock.
Or they might find some data structure in a bad state.
Providing safe cancellation for arbitrary non cooperative threads is not doable even with very large scale changes to thread synchronization objects. Not even by a complete redesign of the thread primitives.
You would have to make thread almost like full processes to be able to do that; but it wouldn't be called a thread then!

Terminating QWebSocketServer with connected sockets

I debug console multithreaded application written in C++/Qt 5.12.1. It is running on Linux Mint 18.3 x64.
This app has SIGINT handler, QWebSocketServer and QWebSocket table. It uses close() QWebSocketServer and call abort()/deleteLater() for items in QWebSocket table to handle the termination.
If the websocket client connects to this console app, then termination fails because of some running thread (I suppose it's internal QWebSocket thread).
Termination is successful if there were no connections.
How to fix it? So that the app gracefully exits.

To gracefully quit the socket server we can attempt:
The most important part is to allow the main thread event loop to run and wait on QWebSocketServer::closed() so that the slot calls QCoreApplication::quit().
That can be done even with:
connect(webSocketServer, &QWebSocketServer::closed,
QCoreApplication::instance(), &QCoreApplication::quit);
If we don't need more detailed reaction.
After connecting that signal before all, proceed with pauseAccepting() to prevent more connections.
Call QWebSocketServer::close.
The below may not be needed if the above sufficient. You need to try the above first, and only if still have problems then deal with existing and pending connections. From my experience the behavior was varying on platforms and with some unique websocket implementations in the server environment (which is likely just Qt for you).
As long as we have some array with QWebSocket instances, we can try to call QWebSocket::abort() on all of them to immediately release. This step seem to be described by the question author.
Try to iterate pending connections with QWebSocketServer::nextPendingConnection() and call abort() for them. Call deleteLater, if that works as well.

There is no need to do anything. What do you mean by "graceful exit"? As soon as there's a request to terminate your application, you should terminate it immediately using exit(0) or a similar mechanism. That's what "graceful exit" should be.
Note: I got reformed. I used to think that graceful exits were a good thing. They are most usually a waste of CPU resources and usually indicate problems in the architecture of the application.
A good rationale for why it should be so written in the kj framework (a part of capnproto).
Quoting Kenton Varda:
KJ_NORETURN(virtual void exit()) = 0;
Indicates program completion. The program is considered successful unless error() was
called. Typically this exits with _Exit(), meaning that the stack is not unwound, buffers
are not flushed, etc. -- it is the responsibility of the caller to flush any buffers that
matter. However, an alternate context implementation e.g. for unit testing purposes could
choose to throw an exception instead.
At first this approach may sound crazy. Isn't it much better to shut down cleanly? What if
you lose data? However, it turns out that if you look at each common class of program, _Exit()
is almost always preferable. Let's break it down:
Commands: A typical program you might run from the command line is single-threaded and
exits quickly and deterministically. Commands often use buffered I/O and need to flush
those buffers before exit. However, most of the work performed by destructors is not
flushing buffers, but rather freeing up memory, placing objects into freelists, and closing
file descriptors. All of this is irrelevant if the process is about to exit anyway, and
for a command that runs quickly, time wasted freeing heap space may make a real difference
in the overall runtime of a script. Meanwhile, it is usually easy to determine exactly what
resources need to be flushed before exit, and easy to tell if they are not being flushed
(because the command fails to produce the expected output). Therefore, it is reasonably
easy for commands to explicitly ensure all output is flushed before exiting, and it is
probably a good idea for them to do so anyway, because write failures should be detected
and handled. For commands, a good strategy is to allocate any objects that require clean
destruction on the stack, and allow them to go out of scope before the command exits.
Meanwhile, any resources which do not need to be cleaned up should be allocated as members
of the command's main class, whose destructor normally will not be called.
Interactive apps: Programs that interact with the user (whether they be graphical apps
with windows or console-based apps like emacs) generally exit only when the user asks them
to. Such applications may store large data structures in memory which need to be synced
to disk, such as documents or user preferences. However, relying on stack unwind or global
destructors as the mechanism for ensuring such syncing occurs is probably wrong. First of
all, it's 2013, and applications ought to be actively syncing changes to non-volatile
storage the moment those changes are made. Applications can crash at any time and a crash
should never lose data that is more than half a second old. Meanwhile, if a user actually
does try to close an application while unsaved changes exist, the application UI should
prompt the user to decide what to do. Such a UI mechanism is obviously too high level to
be implemented via destructors, so KJ's use of _Exit() shouldn't make a difference here.
Servers: A good server is fault-tolerant, prepared for the possibility that at any time
it could crash, the OS could decide to kill it off, or the machine it is running on could
just die. So, using _Exit() should be no problem. In fact, servers generally never even
call exit anyway; they are killed externally.
Batch jobs: A long-running batch job is something between a command and a server. It
probably knows exactly what needs to be flushed before exiting, and it probably should be
fault-tolerant.

How to release resources if a program crashes

I have a program that uses services from others. If the program crashes, what is the best way to close those services? At server side, I would define some checkers that monitor if a client is invalid periodically. But can we do any thing at client? I am not the sure if the normal RAII still works at this case. My code is written in C and C++.

If your application experiences a hard crash, then no, your carefully crafted cleanup code will not run, whether it is part of an RAII paradigm or a method you call at the end of main. None of an application's cleanup code runs after a crash that causes the application to be terminated.
Of course, this is not true for exceptions. Although those might eventually cause the application to be terminated, they still trigger this termination in a controlled way. Generally, the runtime library will catch an unhandled exception and trigger termination. Along the way, your RAII-based cleanup code will be executed, unless it also throws an exception. Then you're back to being unceremoniously ripped out of memory.
But even if your application's cleanup code can't run, the operating system will still attempt to clean up after you. This solves the problem of unreleased memory, handles, and other system objects. In general, if you crash, you need not worry about releasing these things. Your application's state is inconsistent, so trying to execute a bunch of cleanup code will just lead to unpredictable and potentially erroneous behavior, not to mention wasting a bunch of time. Just crash and let the system deal with your mess. As Raymond Chen puts it:
The building is being demolished. Don't bother sweeping the floor and emptying the trash cans and erasing the whiteboards. And don't line up at the exit to the building so everybody can move their in/out magnet to out. All you're doing is making the demolition team wait for you to finish these pointless housecleaning tasks.
Do what you must; skip everything else.
The only problem with this approach is, as you suggest in this question, when you're managing resources that are not controlled by the operating system, such as a remote resource on another system. In that case, there is very little you can do. The best scenario is to make your application as robust as possible so that it doesn't crash, but even that is not a perfect solution. Consider what happens when the power is lost, e.g. because a user's cat pulled the cord from the wall. No cleanup code could possibly run then, so even if your application never crashes, there may be termination events that are outside of your control. Therefore, your external resources must be robust in the event of failure. Time-outs are a standard method, and a much better solution than polling.
Another possible solution, depending on the particular use case, is to run consistency-checking and cleanup code at application initialization. This might be something that you would do for a service that is intended to run continuously and will be restarted promptly after termination. The next time it restarts, it checks its data and/or external resources for consistency, releases and/or re-initializes them as necessary, and then continues on as normal. Obviously this is a bad solution for a typical application, because there is no guarantee that the user will relaunch it in a timely manner.

As the other answers make clear, hoping to clean up after an uncontrolled crash (i.e., a failure which doesn't trigger the C++ exception unwind mechanism) is probably a path to nowhere. Even if you cover some cases, there will be other cases that fail and you are building in a serious vulnerability to those cases.
You mention that the source of the crashes is that you are "us[ing] services from others". I take this to mean that you are running untrusted code in-process, which is the potential source of crashes. In this case, you might consider running the untrusted code "out of process" and communicating back to your main process through a pipe or shared memory or whatever. Then you isolate the crashes this child process, and can do controlled cleanup in your main process. A separate process is really the lightest weight thing you can do that gives you the strong isolation you need to avoid corruption in the calling code.
If forking a process per-call is performance-prohibitive, you can try to keep the child process alive for multiple calls.

One approach would be for your program to have two modes: normal operation and monitoring.
When started in a usual way, it would :
Act as a background monitor.
Launch a subprocess of itself, passing it an internal argument (something that wouldn't clash with normal arguments passed to it, if any).
When the subprocess exists, it would release any resources held at the server.
When started with the internal argument, it would:
Expose the user interface and "act normally", using the resources of the server.

You might look into atexit, which may give you the functionality you need to release resources upon program termination. I don't believe it is infallible, though.
Having said that, however, you should really be focusing on making sure your program doesn't crash; if you're hitting an error that is "unrecoverable", you should still invest in some error-handling code. If the error is caused by a Seg-Fault or some other similar OS-related error, you can either enable SEH exceptions (not sure if this is Windows-specific or not) to enable you to catch them with a normal try-catch block, or write some Signal Handlers to intercept those errors and deal with them.

c++ watchdog for 3rd party lib calls

I have a problem with long running boost::regex_match(...) invocation in a threaded process environment. But it could be another lib (API call) having the same problem.
Is there a generic way to set up a watchdog for such?
For non-threaded process alarm() can be used to detect timeout.
However, signals don't play nicely with threads. I can avoid direct use of alarm() in the thread and delegate timer mgt. to a dedicated separate thread and let that one use pthread_kill(...) to address the correct threads (this is just an idea - i didn't yet verify that part).
However, also this only interrupts and detects the situation, but cannot gracefully stop boost::regex_match(...).
I played around with Throwing an exception from within a signal handler using sigsetjmp() and siglongjmp() for the thread using boost::regex_match(..).
But it causes memory leaks in boost::regex_match(...) becausesiglongjmp()` bypasses destructors.
How can i gracefully stop a 3rd party API call - presuming that it's implemented exception safe?
Or does it have to be supported by some "stoppable" feature actively implemented in the 3rd party API? (is there some for the boost library?)
Maybe some strange idea, but:
Code can be implemented to be "thread-safe" and/or "exception-safe".
Would it be an option to define "longjmp-safe"? This could be done by passing an additional token to a lib to let is associate all resource allocations to that token. After longjmp() the client SW could ask the API separately to release those resources.
simpler maybe would just be some central init()/release() or register()/unregister() API call, by which the API could clean-up itself.

In a case where you have to:
monitor exceeding execution time
stop execution of processing
you should simply think for tasks instead of threads.
Using threads is something which sounds like "state of the art" but in practice tasks are very often the better way of implementation. Especially for controlling memory leeks in "undefined" end of execution, confine unwanted memory excess and control stack overruns etc.
In the case you have mentioned I tend to implement that as tasks. IPC works well on all known platforms but is not portable. If portability is no problem, changing to a task based solution is not a big deal.
A hanging task can be killed by a os call and all locks, memory and other resources like ipc/shared memory/pipes etc. will be removed automatically. So this fits much better to your problem and it did not depend on your external and maybe unchangeable third party components.

Safely cleaning up a blocking std::thread after an automated test

Consider a test case for some Mutex class implementation. The test creates several std::thread instances during execution. All threads should finish if the Mutex class is implemented correctly according to the test. If there is a problem, it's possible that one thread may block indefinitely. How can the test correctly cleanup after itself?
At first I thought to detach the thread, but then the thread is leaked. Even worse, the thread relies on a Mutex instance from inside the test case which will sporadically cause access violations after the test case returns.
Some thread libraries such at Qt’s QThread have terminate() methods, but I’d like to use std::thread even though Qt is already a dependency for my project.
Is there a general pattern for testing potentially indefinitely blocking concurrent code?

Killing threads that may hold a lock is one of the main reasons terminating threads forcibly is frowned upon, and why C++11 doesn't support it. You are not supposed to do it, period.
If you need to do something like it, your best best would probably be to spawn a new process to run the test in; if it locks up, you can terminate the process without the same risks.
For examples of why terminating threads is bad news, take a look at the specific example from the Old New Thing on what sort of garbage thread termination leaves lying around on Windows; similar issues occur on most operating systems under different contexts.

I think destructors could come in help here, is the only thing 100% sure that be executed after any problem, by design. I suggest a nice blocking test inside of some destructor and release resources in a SECURE WAY (smart pointers ?) before leave it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js