How to insulate a job/thread from crashes

How to insulate a job/thread from crashes - c++

I'm working on a library where I'm farming various tasks out to some third-party libraries that do some relatively sketchy or dangerous platform-specific work. (In specific, I'm writing a mathematical function parser that calls JIT-compilers, like LLVM or libjit, to build machine code.) In practice, these third-party libraries have a tendency to be crashy (part of this is my fault, of course, but I still want some insurance).
I'd like, then, to be able to very gracefully deal with a job dying horribly -- SIGSEGV, SIGILL, etc. -- without bringing down the rest of my code (or the code of the users calling my library functions). To be clear, I don't care if that particular job can continue (I'm not going to try to repair a crash condition), nor do I really care about the state of the objects after such a crash (I'll discard them immediately if there's a crash). I just want to be able to detect that a crash has occurred, stop the crash from taking out the entire process, stop calling whatever's crashing, and resume execution.
(For a little more context, the code at the moment is a for loop, testing each of the available JIT-compilers. Some of these compilers might crash. If they do, I just want to execute continue; and get on with testing another compiler.)
Currently, I've got a signal()-based implementation that fails pretty horribly; of course, it's undefined behavior to longjmp() out of a signal handler, and signal handlers are pretty much expected to end with exit() or terminate(). Just throwing the code in another thread doesn't help by itself, at least the way I've tested it so far. I also can't hack out a way to make this work using C++ exceptions.
So, what's the best way to insulate a particular set of instructions / thread / job from crashes?

Spawn a new process.

What output do you collect when a job succeeds?
I ask because if the output is low bandwidth I would be tempted to run each job in its own process.
Each of these crashy jobs you fire up has a high chance of corrupting memory used elsewhere in your process.
Processes offer the best protection.

Processes offer the best protection, but it's possible you can't do that.
If your threads' entry points are functions you wrote, (for example, ThreadProc in the Windows world), then you can wrap them in try{...}catch(...) blocks. If you want to communicate that an exception has occurred, then you can communicate specific error codes back to the main thread or use some other mechanism. If you want to log not only that an exception has occured but what that exception was, then you'll need to catch specific exception types and extract diagnostic information from them to communicate back to the main thread. A'la:
int my_tempermental_thread()
{
try
{
// ... magic happens ...
return 0;
}
catch( const std::exception& ex )
{
// ... or maybe it doesn't ...
string reason = ex.what();
tell_main_thread_what_went_wong(reason);
return 1;
}
catch( ... )
{
// ... definitely not magical happenings here ...
tell_main_thread_what_went_wrong("uh, something bad and undefined");
return 2;
}
}
Be aware that if you go this way you run the risk of hosing the host process when the exceptions do occur. You say you're not trying to correct the problem, but how do you know the malignant thread didn't eat your stack for example? Catch-and-ignore is a great way to create horribly confounding bugs.

On Windows, you might be able to use VirtualProtect(YourMemory, PAGE_READONLY) when calling the untrusted code. Any attempt to modify this memory would cause a Structured Exception. You can safely catch this and continue execution. However, memory allocated by that library will of course leak, as will other resources. The Linux equivalent is mprotect(YorMemory, PROT_READ), which causes a SEGV.

Related

Why wouldn't you declare main() using a function-try-block?

There are a few SO posts about whether or not declaring main() using function-try-block syntax is valid syntax, and the general consensus seems to be that it's perfectly valid. This left me wondering... is there any reason (performance, style, thread synchronization, multithreading) why one wouldn't use this syntax for main() as a general rule to catch any unhandled exceptions anywhere more gracefully?
Obviously, ideally there won't be unhandled exceptions, but they happen and I think it'd be nice to provide something more informative than the OS-specific default handler. For example, in my case, I'd like to provide a support email address to the user so they can report the crash and have my program submit a log to my cloud-based crash log.

For example, in my case, I'd like to provide a support email address to the user
Well, how are you going to do that in a server with no user-facing interface?
Actually, how are you going to do that even in a process with user-facing components, if you have no way to tell in the catch block what state they're in?
And, for those processes where you can't show the user anything useful (or don't have any concept of a "user" in the first place), what would you do in your catch block that would be better than the default terminate?
As for
... more informative than the OS-specific default handler ...
many OS' default behaviour will be to save a complete snapshot of the process execution state, at the point the un-handled exception is thrown, to a file for debugging. As the developer, I can't think of many default behaviours that would be more informative.
Admittedly I'd prefer something more polished as the end user of a desktop app, but that's a pretty small subset of C++ programs.

You can easily convert
int main() try {
// The real code of main
}
catch (...)
{
}
to
int realMain()
{
// The real code of main
}
int main()
{
try
{
return realMain();
}
catch ( ... )
{
}
}
without losing functionality/behavior.
I am going to guess that whether you use the first version or the second version is a matter of coding practices of a team. From a compiler and run time standpoint, I don't see any semantic difference.

If you happened to have a variable that you want to access in your catch block, you would need the curly braces to provide visibility. But even that could be handled with nested try/catch...

why one wouldn't use this syntax for main() as a general rule to catch
any unhandled exceptions anywhere more gracefully?
compatibility with C.
Sometimes there is no way to handle unhandled exceptions more gracefully.
Obviously, ideally there won't be unhandled exceptions, but they
happen and I think it'd be nice to provide something more informative
than the OS-specific default handler. For example, in my case, I'd
like to provide a support email address to the user so they can report
the crash and have my program submit a log to my cloud-based crash
log.
If unexpected exception happens you can not be sure that it is possible to handle it correctly. What are you going to do if there is a network error exception in your example. And trying to send e-mail causes another exception? There can be other errors when you can not be sure that your data is not corrupted and you can not be sure that your program can run correctly after this error. So if you don't know what error happened it is better to allow your program to crash.
You can implement another "watcher" service that checks if process is running and if it has been crashed it can send e-mail to your users with the logs and core dumps.

If you catch the (otherwise) uncaught object, you won't be able to figure out how the execution reached the throw by inspecting the stack trace, because when exception handler is executed, the stack has already been unwound.
If you let the unexpected exception to be uncaught, you may be able to inspect the stack trace in the terminate handler - this is not guaranteed by the standard, but that's not a big deal since there is no standard way to inspect the stack trace either (in C++). You can either use platform specific API within the program, or an external debugger for the inspection.
So for example in your case, the advantage of not catching the exception would be that you can attach a stack trace to the log entry that you intend to submit.
Also, there are cases where an exception can not be handled by a catch block. For example, when you throw from a destructor that is being executed as a result of throwing an exception. So, to handle these "uncatchable" exceptions, you need a terminate handler anyway, so there is little advantage in duplicating the functionality in the case of uncaught exceptions.
As for the syntax that you use to catch the exception, there is no difference. The case where the function try block is different is a constructor, where it allows catching exceptions thrown by sub object constructors.

Can I interrupt function if it is executed for too long?

I a have third party function which I use in my program. I can't replace it; it's in a dynamic library, so I also can't edit it. The problem is that it sometimes runs for too long.
So, can I do anything to stop this function from running if it runs more than 10 seconds for example? (It's OK to close program in this scenario.)
PS. I have Linux, and this program won't have to be ported anywhere else.
What I want is something like this:
#include <stdio.h>
#include <stdlib.h>
void func1 (void) // I can not change contents of this.
{
int i; // random
while (i % 2 == 0);
}
int main ()
{
setTryTime(10000);
timeTry{
func1();
} catchTime {
puts("function executed too long, aborting..");
}
return 0;
}

Sure. And you'd do it just the way you suggested in your title: "signals".
Specifically, an "alarm" signal:
http://linux.die.net/man/2/alarm
http://beej.us/guide/bgipc/output/html/multipage/signals.html

If you really have to do this, you probably want to spawn a process that does nothing but invoke the function and return its result to the caller. If it runs too long, you can kill that process.
By putting it into its own process, you stand a decent (not great, but decent) chance of cleaning up at least most of what it was doing so when it dies unexpectedly it probably won't make a complete mess of things that will lead to later problem.

The potential problem with forcefully cancelling a running function is that it may "own" resources that it intended to return later. The kind of resources that can be problems include:
heap memory allocations (free store)
shared memory segments
threads
sockets
file handles
locks
Some of these resources are managed on a per-process basis, so letting the function run in a different process (perhaps using fork) makes it easier to kill cleanly. Other resources can outlive a process, and really must be cleaned up explicitly. Depending on your operating system, it's also possible that the function may be part-way through interacting with some hardware driver or device, and killing it unexpectedly may leave that driver or device in a bizarre state such that it won't work until after a restart.
If you happen to know that the function doesn't use any of these kind of resources, then you can kill it confidently. But, it's hard to guarantee that: in a large system with many such decisions - which the compiler can't check - evolution of code in functions like func1() is likely to introduce dependencies on such resources.
If you must do this, I'd suggest running it in a different process or thread, and using kill() for processes, pthread_kill if func1() has some support for terminating when a flag is set asynchronously, or the non-portable pthread_cancel if there's really no other choice.

Best practices for recovering from a segmentation fault

I am working on a multithreaded process written in C++, and am considering modifying SIGSEGV handling using google-coredumper to keep the process alive when a segmentation fault occurs.
However, this use of google-coredumper seems ripe with opportunities to get stuck in an infinite loop of core dumps unless I somehow reinitialize the thread and the object that may have caused the core dump.
What best practices should I keep in mind when trying to keep a process alive through a core dump? What other 'gotchas' should I be aware of?
Thanks!

It is actually possible in C. You can achieve it in quite a complicated way:
1) Override signal handler
2) Use setjump() and longjmp() to set the place to jump back, and to actually jump back to there.
Check out this code I wrote (idea taken from "Expert C Programming: Deep C Secrets" by Peter Van Der Linden):
#include <signal.h>
#include <stdio.h>
#include <setjmp.h>
//Declaring global jmp_buf variable to be used by both main and signal handler
jmp_buf buf;
void magic_handler(int s)
{
switch(s)
{
case SIGSEGV:
printf("\nSegmentation fault signal caught! Attempting recovery..");
longjmp(buf, 1);
break;
}
printf("\nAfter switch. Won't be reached");
}
int main(void)
{
int *p = NULL;
signal(SIGSEGV, magic_handler);
if(!setjmp(buf))
{
//Trying to dereference a null pointer will cause a segmentation fault,
//which is handled by our magic_handler now.
*p=0xdead;
}
else
{
printf("\nSuccessfully recovered! Welcome back in main!!\n\n");
}
return 0;
}

The best practice is to fix the original issue causing the core dump, recompile and then relaunch the application.
To catch these errors before deploying in the wild, do plenty of peer review and write lots of tests.

Steve's answer is actually a very useful formula. I've used something similar in a piece of complicated embedded software where there was at least one SIGSEGV error in the code that we could not track down by ship time. As long as you can reset your code to have no ill effects (memory or resource leaks) and the error is not something that causes an endless loop it can be a lifesaver (even though its better to fix the bug). FYI in our case it was single thread.
But what is left out is that once you recover from your signal handler, it will not work again unless you unmask the signal. Here is a chunk of code to do that:
sigset_t signal_set;
...
setjmp(buf);
sigemptyset(&signal_set);
sigaddset(&signal_set, SIGSEGV);
sigprocmask(SIG_UNBLOCK, &signal_set, NULL);
// Initialize all Variables...
Be sure to free up your memory, sockets and other resources or you could leak memory when this happens.

My experience with segmentation faults is that it's very hard to catch them portably, and to do it portably in a multithreaded context is next to impossible.
This is for good reason: Do you really expect the memory (which your threads share) to be intact after a SIGSEGV? After all, you've just proven that some addressing is broken, so the assumption that the rest of the memory space is clean is pretty optimistic.
Think about a different concurrency model, e.g. with processes. Processes don't share their memory or only a well-defined part of it (shared memory), and one process can reasonably work on when another process died. When you have a critical part of the program (e.g. the core temperature control), putting it in an extra process protects it from memory corruption by other processes and segmentation faults.

If a segmentation fault occurs, you're better off just ditching the process. How can you know that any of your process's memory is usable after this? If something in your program is messing with memory it shouldn't, why do you believe it didn't mess with some other part of memory that your process actually can access without segfaulting?
I think that doing this will mostly benefit attackers.

From description of coredumper seems it's purpose not what you intending, but just allowing to make snapshots of process memory.
Personally, I wouldn't keep process after it triggered core dump -- it just so many ways it could be broken -- and would employ some persistence to allow data recovery after process is restarted.
And, yes, as parapura has suggested, better yet, find out what causing SIGSEGV and fix it.

Shutdown Hook c++

is there some way to run code on termination, no matter what kind termination (abnormal,normal,uncaught exception etc.)?
I know its actually possible in Java, but is it even possible in C++? Im assuming a windows environment.

No -- if somebody invokes TerminateProcess, your process will be destroyed without further adieu, and (in particular) without any chance to run any more code in the process of shutting down.

For normal closing applciation I would suggest
atexit()

One good way to approach the problem is using the C++ RAII idiom, which here means that cleanup operations can be placed in the destructor of an object, i.e.
class ShutdownHook {
~ShutdownHook() {
// exit handler code
}
};
int main() {
ShutdownHook h;
//...
}
See the Object Lifetime Manager in ACE library. At the linked document, they discuss about the atexit function as well.

Not for any kind of termination; there are signals that are designed to not be handled, like KILL on Linux.
These signals are designed to terminate a program that has consumed all memory, or CPU, or some other resources, and has left the computer in a state that makes it difficult to run a handler function.

How do I guarantee cleanup code runs in Windows C++ (SIGINT, bad alloc, and closed window)

I have a Windows C++ console program, and if I don't call ReleaseDriver() at the end of my program, some pieces of hardware enter a bad state and can't be used again without rebooting.
I'd like to make sure ReleaseDriver() gets runs even if the program exits abnormally, for example if I hit Ctrl+C or close the console window.
I can use signal() to create a signal handler for SIGINT. This works fine, although as the program ends it pops up an annoying error "An unhandled Win32 exception occurred...".
I don't know how to handle the case of the console window being closed, and (more importantly) I don't know how to handle exceptions caused by bad memory accesses etc.
Thanks for any help!

Under Windows, you can create an unhandled exception filter by calling SetUnhandledExceptionFilter(). Once done, any time an exception is generated that is not handled somewhere in your application, your handler will be called.
Your handler can be used to release resources, generate dump files (see MiniDumpWriteDump), or whatever you need to make sure gets done.
Note that there are many 'gotchas' surrounding how you write your exception handler function. In particular:
You cannot call any CRT function, such as new
You cannot perform any stack-based allocation
If you do anything in your handler which causes an exception, Windows will immediately terminate your application by ripping the bones out of its back. You get no further chances to shut down gracefully.
You can call many Windows API functions. But you can't sprintf, new, delete... In short, if it isn't a WINAPI function, it probably isn't safe.
Because of all of the above, it is advisable to make all the variables in your handler function static variables. You won't be able to use sprintf, so you will have to format strings ahead of time, during initialization. Just remember that the machine is in a very unstable state when your handler is called.

If I'm not mistaken, you can detect if the console is closed or the program is terminated with Ctrl+C with SetConsoleCtrlHandler:
#include <windows.h>
BOOL CtrlHandler(DWORD)
{
MessageBox(NULL, "Program closed", "Message", MB_ICONEXCLAMATION | MB_OK);
exit(0);
}
int main()
{
SetConsoleCtrlHandler((PHANDLER_ROUTINE)&CtrlHandler, TRUE);
while (true);
}
If you are worried about exceptions, like bad_alloc, you can wrap main into a try block. Catch std::exception& which should ideally be the base class of all thrown exception, but you can also catch any C++ exception with catch (...). With those exceptions, though, not all is lost, and you should figure out what is being thrown and why.
Avoiding undefined behavior also helps. :)

You can't (guarantee code runs). You could lose power, then nothing will run. The L1 instruction cache of your CPU could get fried, then your code will fail in random ways.
The most sure way of running cleanup code is in a separate process that watches for exit of the first (just WaitForSingleObject on the process handle). A separate watchdog process is as close as you can get to a guarantee (but someone could still TerminateProcess your watchdog).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to insulate a job/thread from crashes - c++

Spawn a new process.

What output do you collect when a job succeeds? I ask because if the output is low bandwidth I would be tempted to run each job in its own process. Each of these crashy jobs you fire up has a high chance of corrupting memory used elsewhere in your process. Processes offer the best protection.

Related

Why wouldn't you declare main() using a function-try-block?

Can I interrupt function if it is executed for too long?

Best practices for recovering from a segmentation fault

Shutdown Hook c++

How do I guarantee cleanup code runs in Windows C++ (SIGINT, bad alloc, and closed window)

Categories

Resources