I'm wondering if there is a way to catch all segment fault/core dump, and print its callstack? Catching all signals look like a doable way, but I'm not pretty sure how it works, according to some of my experience, it did not always give the outcome I want, sometimes it just failed to catch that core dump, maybe I did something wrong.
The reason I ask this is that I usually debug with a very complicated system, and many segmentfault errors are hard to reproduce, and not practical to run it with gdb line by line. So if I can catch all segmentfault and print some call stack or other information, that would be great to help me debugging.
I'm wondering if there is a way to catch all segment fault ...
Certainly. This can be achieved by registering a signal handler using std::signal.
... and print its callstack?
This is a quite a lot trickier. There is no standard way to inspect that call stack in C++. Linux has several ways, but unfortunately the most conventional way of using the POSIX standard backtrace function is not async-signal safe, so there is no guarantee that it would work in a signal handler.
A simpler approach is to not do this in a signal handler, but instead configure Linux to generate a core dump. You'll get much more information out of that.
... core dump, and print its callstack?
Certainly. You can use a debugger:
gdb /path/to/executable /path/to/corefile
(gdb) bt
and not practical to run it with gdb line by line.
Then run it in gdb but not line by line. Simply run the program within gdb until it segfaults, then print the backtrace.
Related
My Qt app uses cURL library to send HTTP requests and sometimes cURL sends SIGSEGV and after that my app crashes.
Is it possible to catch this signal and prevent segmentation fault ?
TL;DR:
Don't attempt to block or ignore SIGSEGV. It will just bring you pain.
Explanation:
SIGSEGV means very bad things have happened, and your program has wandered into memory that it is not allowed to access. This means your program is utterly hosed.
Pardon my Canadian.
Your program is broken. You can't trust it to behave in any rational fashion anymore and should actually thank that SIGSEGV for letting you know this. As much as you don't want to see SIGSEGV, it's a hell of a lot better than the alternative of a broken program continuing to run and spewing out false information.
You can catch a SIGSEGV with a signal handler, but say you do catch and try to handle it. Odds are the program goes right back into the code that triggered the SIGSEGV and very likely raises it again as soon as the signal handler. Or does something else weird. Possibly something worse.
It's not really even worth catching SIGSEGV with a signal handler to make an attempt to output a message, because what's the message going to say? "Oh Smurf. Something very bad happened."?
So, yes, you can catch it, but you can't do anything to recover. The program is already too damaged to continue. That leaves preventing SIGSEGV in the first place, and that means fixing your code.
The best thing you can do is determine why the program crashed. The easiest way to do that is run your program with whatever debugger came with your development environment, wait for the crash, then inspect the backtrace for hints on how the program came to meet its doom.
Typically a link to Eric Lippert's How to debug small programs can be found right about here, and I can't think of a good reason to leave it out.
One other thing, though. cURL is a pretty solid library. Odds are overwhelmingly good that you are using it wrong or passed it bad information: a dead pointer, an unterminated C-style string, an pointer to an explosive function. I'd start by looking at how you are using cURL.
No, it's not. Instead, fix the bug and prevent the segmentation fault that way. Presumably your platform supplies some kind of debugger.
When debugging a C++ program emit a SIGSEGV with gdb,it is possible to handle the signal and asked to nostop.
How gdb handles this kind of scenario ??
Have searched gdb source code and couldn't find a starting point.
You cannot automatically ignore SIGSEGV. I also wouldn't recommend doing that anyway. Although you can make gdb ignore the signal and not pass it to the program, the kernel will attempt to re-run the offending instruction once the signal handler returns and results in an infinite loop. See this answer for more information.
One way to work around it is to the skip the instruction or change register values so that it does not segfault. The link shows an example of setting a register. You can also use the jump command to skip over an instruction.
is possible to handle the signal and asked to nostop.
It's unclear whether you want GDB to handle the signal, or the program itself.
If the latter, gdb handle SIGSEGV nostop noprint pass will do exactly that.
This is actually something the OS does. On Windows, if a program has a debugger attached and an exception is thrown, Windows will ask the debugger if it wants to handle it. If/when it declines, it passes it to the program. If the program doesn't handle it, Windows passes it to the debugger again.
As the question states, is it useful to always collect a software-based backtrace (like using libc backtrace http://www.gnu.org/software/libc/manual/html_node/Backtraces.html ) in all error functions and signal handlers ?
Would it not be very helpful for debugging a wide variety of bugs like memory, concurrency bugs etc. ? I guess it would not hurt normal performance as well as it will be triggered only in error paths.
As the question states, is it useful to always collect a software-based backtrace
Yes, it is generally very useful to have a crash stack trace when:
your code runs in your own environment, and you are not worried about the stack trace revealing any secrets.
when the crash handler does not further corrupt the coredump, does not hang, etc.
like using libc backtrace
glibc backtrace calls calloc under certain conditions, and is not safe in a crash handler. It can cause both the hang, and the further corruption mentioned above. Writing a crash handler that will reliably print stack trace in async-signal-safe manner is quite non-trivial.
why then do error functions in "standard" applications not call backtrace?
Consider cat /no/such/file. Currently it produces:
cat: /no/such/file: No such file or directory
which is all you really need to know. Making this print anything else is useless. If you had many such files, and cat printed a full stack trace for each, you'd get many pages of error output, and that would only make finding the real problem harder.
For fatal signal handlers (e.g. SIGSEGV) the answer is that most "standard" applications don't actually handle such signals, and simply use the default action, which produces a core dump.
But if they did catch the signal, calling backtrace, backtrace_symbols, or backtrace_symbols_fd from the signal handler would be equally unsafe, and could deadlock, which is much worse than simply dumping core. Consider what happens if you have a long-running script with a 1000 commands in it. You start it, and a week later discover that it didn't make any progress because the second command crashed and deadlocked trying to print the crash stack trace.
I wrote a Linux program based on a buggy open source library. This library sometimes triggers segfaults that I cannot control. And of course once the library has segfaults, the entire program dies. However, I have to make sure my program keeps running even if the library has segfaults. This is because my program sort of serves as a "server" and it needs to at least tell the clients something bad happened and recover from the errors, rather than chicken out... Is there any way to do that?
I understand in Java one just needs to catch an exception. But how does C++ handle this?
[UPDATE]I understand there is also exception handling in C++, but Segfault is not an exception, is it? I don't think anything is thrown when segfault happens. You have to explicitly "throw" something to use try.... catch.... as far as I know.
Thanks so much, I am quite new to C++.
You cannot reliably resume execution after a segmentation violation. If your program must remain running, fence off the offending library in a separate process and communicate with it over a pipe. When it takes a segmentation violation, your program will notice the closed pipe.
Unfortunately, you cannot make the program continue. The buggy code that resulted in SIGSEGV usually triggers undefined behaviour like dereferencing a null pointer or reading garbage memory. You cannot possibly continue if your code operates on invalid data.
You can handle the signal, but the most you can do is dump the stack trace and die.
C and C++ are inherently unsafe, you cannot handle errors triggered by undefined behaviour and let the program continue.
You can use signal handlers. It's not really recommended though because you can't guarantee that you've eliminated the cause of the problem. The best thing to do would be to isolate it in a separate process- this is the approach Google Chrome takes.
If it's FOSS, the easiest thing to do would be to just debug it.
If you have access to the source, it might be useful to run the programmer in a debugger like GDB. GDB stops at the line which causes the segfault.
If you really want to catch the signal though, you need to hook up a signal handler, using the signal system call. I would probably just stick to the debugger though.
EDIT:
Since you write that the library segfaults, I would just like to point out the first rule of programming: It's always your fault. Especially if you are a new to C++, the segfault probably happens because you have used the library incorrectly in some way. C++ is a very subtle language and it is easy to do things you don't intend.
As mentioned over here you can’t catch segfault signals with try blocks or “map” segment violations to anything. It’s really bad idea to handle SIGSEGV yourself. SEGV from C++ code is a severe error. You can use gdb to figure it out why and solve it.
I want to be able to detect when a write to memory address occurs -- for example by setting a callback attached to an interrupt. Does anyone know how?
I'd like to be able to do this at runtime (possibly gdb has this feature, but my particular
application causes gdb to crash).
If you want to intercept writes to a range of addresses, you can use mprotect() to mark the memory in question as non-writeable, and install a signal handler using sigaction() to catch the resulting SIGSEGV, do your logging or whatever and mark the page as writeable again.
What you need is access to the X86 debug registers: http://en.wikipedia.org/wiki/Debug_register
You'll need to set the breakpoint address in one of DR0 to DR3, and then the condition (data write) in DR7. The interrupt will occur and you can run your debug code to read DR6 and find what caused the breakpoint.
If GDB doesn't work, you might try a simpler/smaller debugger such as http://sourceforge.net/projects/minibug/ - if that isn't working, you can at least go through the code and understand how to use the debugging hardware on the processor yourself.
Also, there's a great IBM developer resource on mastering linux debugging techniques which should provide some additional options:
http://www.ibm.com/developerworks/linux/library/l-debug/
A reasonably good article on doing this is windows is here (I know you're running on linux, but others might come along to this question wanting to do it in windows):
http://www.codeproject.com/KB/debug/hardwarebreakpoint.aspx
-Adam
GDB does have that feature: it is called hardware watchpoints, and it is very well supported on Linux/x86:
(gdb) watch *(int *)0x12345678
If your application crashes GDB, build current GDB from CVS Head.
If that GDB still fails, file a GDB bug.
Chances are we can fix GDB faster than you can hack around SIGSEGV handler (provided a good test case), and fixes to GDB help you with future problems as well.
mprotect does have a disadvantage: your memory must be page-boundary aligned. I had my problematic memory on the stack and was not able to use mprotect().
As Adam said, what you want is to manipulate the debug registers. On windows, I used this: http://www.morearty.com/code/breakpoint/ and it worked great. I also ported it to Mach-O (Mac OS X), and it worked great, too. It was also easy, because Mach-O has thread_set_state(), which is equivalent to SetThreadContext().
The Problem with linux is that it doesn't have such equivalents. I found ptrace, but I thought, this can't be it, there must be something simpler. But there isn't. Yet. I think they are working on a hw_breakpoint API for both kernel and user space. (see http://lwn.net/Articles/317153/)
But when I found this: http://blogs.oracle.com/nike/entry/memory_debugger_for_linux I gave it a try and it wasn't that bad. The ptrace method works by some "outside process" acting as a "debugger", attaching to your program, injecting new values for the debug registers, and terminating with your program continuing with a new hw breakpoint set. The thing is, you can create this "outside process" yourself by using fork(), (I had no success with a pthread), and doing these simple steps inline in your code.
The addwatchpoint code must be adapted to work with 64 bit linux, but that's just changing USER_DR7 etc. to offsetof(struct user, u_debugreg[7]). Another thing is that after a PTRACE_ATTACH, you have to wait for the debuggee to actually stop. But instead of retrying a POKEUSER in a busy loop, the correct thing to do would be a waitpid() on your pid.
The only catch with the ptrace method is that your program can have only one "debugger" attached at a time. So a ptrace attach will fail if your program is already running under gdb control. But just like the example code does, you can register a signal handler for SIGTRAP, run without gdb, and when you catch the signal, enter a busy loop waiting for gdb to attach. From there you can see who tried to write your memory.