Best practices for recovering from a segmentation fault

Best practices for recovering from a segmentation fault - c++

I am working on a multithreaded process written in C++, and am considering modifying SIGSEGV handling using google-coredumper to keep the process alive when a segmentation fault occurs.
However, this use of google-coredumper seems ripe with opportunities to get stuck in an infinite loop of core dumps unless I somehow reinitialize the thread and the object that may have caused the core dump.
What best practices should I keep in mind when trying to keep a process alive through a core dump? What other 'gotchas' should I be aware of?
Thanks!

It is actually possible in C. You can achieve it in quite a complicated way:
1) Override signal handler
2) Use setjump() and longjmp() to set the place to jump back, and to actually jump back to there.
Check out this code I wrote (idea taken from "Expert C Programming: Deep C Secrets" by Peter Van Der Linden):
#include <signal.h>
#include <stdio.h>
#include <setjmp.h>
//Declaring global jmp_buf variable to be used by both main and signal handler
jmp_buf buf;
void magic_handler(int s)
{
switch(s)
{
case SIGSEGV:
printf("\nSegmentation fault signal caught! Attempting recovery..");
longjmp(buf, 1);
break;
}
printf("\nAfter switch. Won't be reached");
}
int main(void)
{
int *p = NULL;
signal(SIGSEGV, magic_handler);
if(!setjmp(buf))
{
//Trying to dereference a null pointer will cause a segmentation fault,
//which is handled by our magic_handler now.
*p=0xdead;
}
else
{
printf("\nSuccessfully recovered! Welcome back in main!!\n\n");
}
return 0;
}

The best practice is to fix the original issue causing the core dump, recompile and then relaunch the application.
To catch these errors before deploying in the wild, do plenty of peer review and write lots of tests.

Steve's answer is actually a very useful formula. I've used something similar in a piece of complicated embedded software where there was at least one SIGSEGV error in the code that we could not track down by ship time. As long as you can reset your code to have no ill effects (memory or resource leaks) and the error is not something that causes an endless loop it can be a lifesaver (even though its better to fix the bug). FYI in our case it was single thread.
But what is left out is that once you recover from your signal handler, it will not work again unless you unmask the signal. Here is a chunk of code to do that:
sigset_t signal_set;
...
setjmp(buf);
sigemptyset(&signal_set);
sigaddset(&signal_set, SIGSEGV);
sigprocmask(SIG_UNBLOCK, &signal_set, NULL);
// Initialize all Variables...
Be sure to free up your memory, sockets and other resources or you could leak memory when this happens.

My experience with segmentation faults is that it's very hard to catch them portably, and to do it portably in a multithreaded context is next to impossible.
This is for good reason: Do you really expect the memory (which your threads share) to be intact after a SIGSEGV? After all, you've just proven that some addressing is broken, so the assumption that the rest of the memory space is clean is pretty optimistic.
Think about a different concurrency model, e.g. with processes. Processes don't share their memory or only a well-defined part of it (shared memory), and one process can reasonably work on when another process died. When you have a critical part of the program (e.g. the core temperature control), putting it in an extra process protects it from memory corruption by other processes and segmentation faults.

If a segmentation fault occurs, you're better off just ditching the process. How can you know that any of your process's memory is usable after this? If something in your program is messing with memory it shouldn't, why do you believe it didn't mess with some other part of memory that your process actually can access without segfaulting?
I think that doing this will mostly benefit attackers.

From description of coredumper seems it's purpose not what you intending, but just allowing to make snapshots of process memory.
Personally, I wouldn't keep process after it triggered core dump -- it just so many ways it could be broken -- and would employ some persistence to allow data recovery after process is restarted.
And, yes, as parapura has suggested, better yet, find out what causing SIGSEGV and fix it.

Related

Break loop when SIGSEGV received C/C++

I have to use c++ classes which is not properly written - there is no information if one function in loop is executed properly or not.
If it is not, I receive segmentation fault and I'm loosing everything what was calculated. I would like to convert SIGSEGV signal to break loop. Is there any possibility?
Using signal handlers from #include <csignal> doesn't help.

A segmentation fault may happen in two ways:
uncontrolled segfaults where a process is accessing addresses for which this access is not well defined.
#JSB this is the case you're dealing with and there's little you can do about it, other then getting the offending code fixed.
When an uncontrolled segfault happens, which is the case for buggy code in 99.999% of all cases the only reasonable thing to do is cover your losses (you may write to already opened files from a SIGSEGV handler) and terminate the process.
#JSB the following does not apply to you! This is just for completenes!
"controlled" segfaults where a process accesses addresses which are allocated by the process, but read/write/execute access is disabled.
A controlled segfault may be induced in the following way
size_t const sz_p = pagesize;
char *p = mmap(NULL, sz_p, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
strcpy(p, "sigsegv");
So why is this a controlled segfault? Because you can actually react to it in a sensible way. In the SIGSEGV handler you can set the memory protection of the pages which access caused the segfault to allow the access
void sigsegv_handler(int, siginfo_t *info, void *)
{
if( ((char*)info->si_addr - p) < sz_p
&& ((char*)info->si_addr - p) >= 0 ) {
mprotect(p, sz_p, PROT_READ | PROT_WRITE);
}
}
It is important to understand that this kind of SIGSEGV handler is well behaved and defined only if the segfault was caused by access to an actually allocated memory objects and if the signal handler action only sets memory protection flags on memory objects owned by the process. You can't use it to make broken code magically work!
So why would one actually do this? One example would be client side implementation of APIs that allow network distribution and also allow to map objects into memory, like OpenGL, which has the API functions glMapBuffer / glUnmapBuffer. To avoid unneccesary round trips and transfer you'd want to transfer only those parts of the buffer actually read from and/or modified. For this you have to somehow detect which pages a program touches. Some OSs (like Windows) have a dedicated API for this, but in *nix-es you have to work with mmap + mprotect + SIGSEGV handler tricks to implement this.

How do I use the stack but avoid a stack overflow in C++

I'm presently moving back to C++ from Java. There are some areas of C++ where higher performance can be achieved by doing more computation on the stack.And some recursive algorithms operate more efficiently on the stack than on the heap.
Obviously the stack is a resource, and if I am going to use it, I should ensure that I do not consume too much (to the point of crashing my program).
I'm running Xcode, and wrote the following simple program:
#include <csignal>
static bool interrupted = false;
long stack_test(long limit){
if((limit>0)&&(interrupted==false))
return stack_test(limit-1)+1; // program crashes here with EXC_BAD_ACCESS...
else
return 0;
}
void signal_handler(int sig){
interrupted = true;
}
int main(char* args[]){
signal(SIGSEGV,&signal_handler);
stack_test(1000000);
signal(SIGSEGV,SIG_DFL);
}
The documentation states that running on BSD, stack limits can be checked by using getrlimit() and that when the stack limit is being reached, a SIGSEGV event is issued. I tried installing the above event handler for this event, but instead, my program stops at the next iteration with EXT_BAD_ACCESS (code=2, ...).
Am I taking the wrong approach here, or is there a better way?

This has the same problem in Java as it does in c++. You are way over-committing to the stack.
And some recursive algorithms operate more efficiently on the stack than on the heap.
Indeed, and they are commonly of the divide and conquer type.
The usefulness of recursion is to reduce the computation to a more manageable computation with each call. limit - 1 is not such a candidate.
If your question is only about the signal, I unfortunately can't offer you any advice on your system.

Your signal handler can't do much to fix the stack overflow. Setting your interrupted flag doesn't help. When your signal handler returns, the instruction that tried to write to an address beyond the end of the stack resumes and it's still going to attempt to write beyond the end of the stack. Your code won't get back to the part which checks your interrupted flag.
With great care and a lot of architecture-specific code, your signal handler could potentially change the context of the thread which encountered the signal such that, when it resumes, it will be at a different point in the code.
You could also use setjmp() and longjmp() to accomplish this at a coarser granularity.
A different approach would be to set up a thread to use a stack that your code allocated, using pthread_attr_setstackaddr() and pthread_attr_setstacksize() prior to pthread_create(). You would run your code in that secondary thread and not the main one. You could set the last page or two of the stack you allocated to be non-writable using mprotect(). Then, your signal handler could set the interrupted flag and also set those pages to be writable. That should give you enough headroom that the resumed code can execute without re-raising the signal, get far enough to check the flag, and return gracefully. Note that this is a one-time last resort, unless you can find a good point to set those guard pages non-writable again.

Can I interrupt function if it is executed for too long?

I a have third party function which I use in my program. I can't replace it; it's in a dynamic library, so I also can't edit it. The problem is that it sometimes runs for too long.
So, can I do anything to stop this function from running if it runs more than 10 seconds for example? (It's OK to close program in this scenario.)
PS. I have Linux, and this program won't have to be ported anywhere else.
What I want is something like this:
#include <stdio.h>
#include <stdlib.h>
void func1 (void) // I can not change contents of this.
{
int i; // random
while (i % 2 == 0);
}
int main ()
{
setTryTime(10000);
timeTry{
func1();
} catchTime {
puts("function executed too long, aborting..");
}
return 0;
}

Sure. And you'd do it just the way you suggested in your title: "signals".
Specifically, an "alarm" signal:
http://linux.die.net/man/2/alarm
http://beej.us/guide/bgipc/output/html/multipage/signals.html

If you really have to do this, you probably want to spawn a process that does nothing but invoke the function and return its result to the caller. If it runs too long, you can kill that process.
By putting it into its own process, you stand a decent (not great, but decent) chance of cleaning up at least most of what it was doing so when it dies unexpectedly it probably won't make a complete mess of things that will lead to later problem.

The potential problem with forcefully cancelling a running function is that it may "own" resources that it intended to return later. The kind of resources that can be problems include:
heap memory allocations (free store)
shared memory segments
threads
sockets
file handles
locks
Some of these resources are managed on a per-process basis, so letting the function run in a different process (perhaps using fork) makes it easier to kill cleanly. Other resources can outlive a process, and really must be cleaned up explicitly. Depending on your operating system, it's also possible that the function may be part-way through interacting with some hardware driver or device, and killing it unexpectedly may leave that driver or device in a bizarre state such that it won't work until after a restart.
If you happen to know that the function doesn't use any of these kind of resources, then you can kill it confidently. But, it's hard to guarantee that: in a large system with many such decisions - which the compiler can't check - evolution of code in functions like func1() is likely to introduce dependencies on such resources.
If you must do this, I'd suggest running it in a different process or thread, and using kill() for processes, pthread_kill if func1() has some support for terminating when a flag is set asynchronously, or the non-portable pthread_cancel if there's really no other choice.

How to insulate a job/thread from crashes

I'm working on a library where I'm farming various tasks out to some third-party libraries that do some relatively sketchy or dangerous platform-specific work. (In specific, I'm writing a mathematical function parser that calls JIT-compilers, like LLVM or libjit, to build machine code.) In practice, these third-party libraries have a tendency to be crashy (part of this is my fault, of course, but I still want some insurance).
I'd like, then, to be able to very gracefully deal with a job dying horribly -- SIGSEGV, SIGILL, etc. -- without bringing down the rest of my code (or the code of the users calling my library functions). To be clear, I don't care if that particular job can continue (I'm not going to try to repair a crash condition), nor do I really care about the state of the objects after such a crash (I'll discard them immediately if there's a crash). I just want to be able to detect that a crash has occurred, stop the crash from taking out the entire process, stop calling whatever's crashing, and resume execution.
(For a little more context, the code at the moment is a for loop, testing each of the available JIT-compilers. Some of these compilers might crash. If they do, I just want to execute continue; and get on with testing another compiler.)
Currently, I've got a signal()-based implementation that fails pretty horribly; of course, it's undefined behavior to longjmp() out of a signal handler, and signal handlers are pretty much expected to end with exit() or terminate(). Just throwing the code in another thread doesn't help by itself, at least the way I've tested it so far. I also can't hack out a way to make this work using C++ exceptions.
So, what's the best way to insulate a particular set of instructions / thread / job from crashes?

Spawn a new process.

What output do you collect when a job succeeds?
I ask because if the output is low bandwidth I would be tempted to run each job in its own process.
Each of these crashy jobs you fire up has a high chance of corrupting memory used elsewhere in your process.
Processes offer the best protection.

Processes offer the best protection, but it's possible you can't do that.
If your threads' entry points are functions you wrote, (for example, ThreadProc in the Windows world), then you can wrap them in try{...}catch(...) blocks. If you want to communicate that an exception has occurred, then you can communicate specific error codes back to the main thread or use some other mechanism. If you want to log not only that an exception has occured but what that exception was, then you'll need to catch specific exception types and extract diagnostic information from them to communicate back to the main thread. A'la:
int my_tempermental_thread()
{
try
{
// ... magic happens ...
return 0;
}
catch( const std::exception& ex )
{
// ... or maybe it doesn't ...
string reason = ex.what();
tell_main_thread_what_went_wong(reason);
return 1;
}
catch( ... )
{
// ... definitely not magical happenings here ...
tell_main_thread_what_went_wrong("uh, something bad and undefined");
return 2;
}
}
Be aware that if you go this way you run the risk of hosing the host process when the exceptions do occur. You say you're not trying to correct the problem, but how do you know the malignant thread didn't eat your stack for example? Catch-and-ignore is a great way to create horribly confounding bugs.

On Windows, you might be able to use VirtualProtect(YourMemory, PAGE_READONLY) when calling the untrusted code. Any attempt to modify this memory would cause a Structured Exception. You can safely catch this and continue execution. However, memory allocated by that library will of course leak, as will other resources. The Linux equivalent is mprotect(YorMemory, PROT_READ), which causes a SEGV.

I get an exception if I leave the program running for a while

Platform : Win32
Language : C++
I get an error if I leave the program running for a while (~10 min).
Unhandled exception at 0x10003fe2 in ImportTest.exe: 0xC0000005: Access violation reading location 0x003b1000.
I think it could be a memory leak but I don't know how to find that out.
Im also unable to 'free()' memory because it always causes (maybe i shouldn't be using free() on variables) :
Unhandled exception at 0x76e81f70 in ImportTest.exe: 0xC0000005: Access violation reading location 0x0fffffff.
at that stage the program isn't doing anything and it is just waiting for user input
dllHandle = LoadLibrary(L"miniFMOD.dll");
playSongPtr = (playSongT)GetProcAddress(dllHandle,"SongPlay");
loadSongPtr = (loadSongT)GetProcAddress(dllHandle,"SongLoadFromFile");
int songHandle = loadSongPtr("FILE_PATH");
// ... {just output , couldn't cause errors}
playSongPtr(songHandle);
getch(); // that is where it causes an error if i leave it running for a while
Edit 2:
playSongPtr(); causes the problem. but i don't know how to fix it

I think it's pretty clear that your program has a bug. If you don't know where to start looking, a useful technique is "divide and conquer".
Start with your program in a state where you can cause the exception to happen. Eliminate half your code, and try again. If the exception still happens, then you've got half as much code to look through. If the exception doesn't happen, then it might have been related to the code you just removed.
Repeat the above until you isolate the problem.
Update: You say "at that stage the program isn't doing anything" but clearly it is doing something (otherwise it wouldn't crash). Is your program a console mode program? If so, what function are you using to wait for user input? If not, then is it a GUI mode program? Have you opened a dialog box and are waiting for something to happen? Have you got any Windows timers running? Any threads?
Update 2: In light of the small snippet of code you posted, I'm pretty sure that if you try to remove the call to the playSongPtr(songHandle) function, then your problem is likely to go away. You will have to investigate what the requirements are for "miniFMOD.dll". For example, that DLL might assume that it's running in a GUI environment instead of a console program, and may do things that don't necessarily work in console mode. Also, in order to do anything in the background (including playing a song), that DLL probably needs to create a thread to periodically load the next bit of the song and queue it in the play buffer. You can check the number of threads being created by your program in Task Manager (or better, Process Explorer). If it's more than one, then there are other things going on that you aren't directly controlling.

The error tells you that memory is accessed which you have not allocated at the moment. It could be a pointer error like dereferencing NULL. Another possibility is that you use memory after you freed it.
The first step would be to check your code for NULL reference checks, i.e. make sure you have a valid pointer before you use it, and to check the lifecycle of all allocated and freed resources. Writing NULL's over references you just freed might help find the problem spot.

I doubt this particular problem is a memory leak; the problem is dereferencing a pointer that does not point to something useful. To check for a memory leak, watch your process in your operating system's process list tool (task manager, ps, whatever) and see if the "used memory" value keeps growing.
On calling free: You should call free() once and only once on the non-null values returned from malloc(), calloc() or strdup(). Calling free() less than once will lead to a memory leak. Calling free() more than once will lead to memory corruption.
You should get a stack trace to see what is going on when the process crashes. Based on my reading of the addresses involved you probably have a stack overflow or have an incorrect pointer calculation using a stack address (in C/C++ terms: an "auto" variable.) A stack trace will tell you how you got to the point where it crashed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js