Recover from crash with a core dump - c++

A C++ program crashed on FreeBSD 6.2 and OS was kind enough to create a core dump. Is it possible to amputate some stack frames, reset the instruction pointer and restart the process in gdb, and how?

Is it possible to amputate some stack frames, reset the instruction pointer and restart the process in gdb?
I assume you mean: change the process state, and set it to start executing again (as if it never crashed in the first place).
No. For one thing, how do you propose GDB (if it magically had this capability) would handle your file descriptors (which the kernel automatically closed when your process died)?

Yes, gdb can debug core dumps just as well as running programs. Assuming that a.out is the name of your program's executable and that a.core is the name of your core file, invoke gdb like so:
gdb a.out a.core
And then you can debug like normal, except you cannot continue execution in any way (even if you could, the program would just crash again). You can examine the stack trace, registers, memory, etc.

Possible duplicate of this: Best practices for recovering from a segmentation fault
Summary: It is possible but not recommended. The way to do it is to usse setjmp() and longjmp() from a signal handler. (Please look at complete source code example in duplicate post.

Related

How is gdb stack trace readability of release code influenced on x64?

I am working on a project, where the request "we want more information in release build stack traces" came up.
With "stack trace" I mean basically the output of t a a bt in gdb, which I suppose to be equivalent to the output of gstack for a running process. If this is true would be one of my questions.
My main problem is that availability of stack traces is rather erratic (sometimes you have them, sometimes you don't) and documentation could be more detailed (e.g. gdb documentation states that "-fomit-frame-pointer makes debugging impossible on some machines.", without any clear information about x86_64)
Also, when examining a running program with gstack, I get a quite perfect stack traces. I am unsure, though, if this is exactly what I would get from a core dump with gdb (which would mean that all cases where I get less information, the stack has been really corrupted).
Currently, the code is compiled with -O2. I have seen one stack trace lately, where our own program code's stack frames did not have any function parameter values, but the first (inner) frames, where our code already called a third party library, provided these values. Here, I am not sure if this is a sign that the first party library had better gcc debugging options set, or if these information is just lost at some point iterating down the stack trace.
I guess my questions are:
Which compiler options influence the stack trace quality on x86_64
are stack traces from these origins identical:
output of gstack of a running program
attached gdb to a running program, executed t a a bt
called gcore on a running program, opening core with gdb, then t a a bt
program aborted and core file written by system, opened with gdb
Is there some in-depth documentation which parameters affect stack trace quality on x86_64?
All considerations made under the assumption that the program binary exists for the core dump, and source code is not available.
With "quality of a stack trace" i mean 3 criteria:
called function names are available, not just "??"
The source codes file name and line number is available
function call parameters are available.
Which compiler options influence the stack trace quality on x86_64
The -fomit-frame-pointer is the default on x86_64, and does not cause stack traces to be unusable.
GDB relies on unwind descriptors, and you could strip these with either strip or -fno-unwind-tables (this is ill-advised).
are stack traces from these origins identical:
- output of gstack of a running program
Last I looked, gstack was a trivial shell script that invoked gdb, so yes.
attached gdb to a running program, executed "t a a bt"
Yes.
called gcore on a running program, opening core with gdb, then "t a a bt"
Yes, provided the core is opened with GDB on the same system where gcore was run.
program aborted and core file written by system, opened with gdb
Same as above.
If you are trying to open core on a different system from the one where it was produced, and the binary uses dynamic libraries, you need to set sysroot appropriately. See this question and answer.
Note that there are a few reasons stack may look corrupt or unavailable in GDB:
-fno-unwind-tables or striping mentioned above
code compiled from assembly, and lacking proper .cfi directives
third party libraries that were built with very old compiler, and have incorrect unwind descriptors (anything before gcc-4.4 was pretty bad).
and finally, stack corruption.

Locating segmentation fault for multithread program running on cluster

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we don't have an interactive operation.
How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.
I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.
The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.
The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile to debug the program program which produced the core file corefile: You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb matching the hardware of the machine where it was run.
I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.
Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, e.g., ps -elf | grep program) and then attach to it using
gdb program pid
I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP for SIGSEGV...).
That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.

Printing certain variables in gdb causes execution to resume

I have been using gdb for most of a decade now and have never seen this particular problem. I upgraded to gdb 7.4 and the problem persists.
I am debugging a Cilk-multithreaded C++ application on RHEL5. Execution ceases at a seg fault. When I ask gdb to print the value of certain variables (which are boost::intrusive_ptr references to templated object instances) gdb will print the proper value, but will also resume execution on all threads for a very short time. I suspect it resumes execution because more of my debug print statements scroll to the terminal (it's not just clearing a buffer---I can keep printing it and it keeps resuming execution). This short spurt of continued execution causes the values of the variables I am tracking to change. This hinders debugging, to say the least.
I suspect that I have a memory leak and the stack is getting corrupted, but I've run valgrind on the code (with different initial conditions) and it shows no memory leaks in the major subsystem that I am debugging, except for a nominal Cilk-internal leak.
When I ask gdb to print the value of certain variables (which are boost::intrusive_ptr references to templated object instances) gdb will print the proper value, but will also resume execution on all threads for a very short time.
The only way that I know of for this to happen is if you have
A python pretty-printer for the type (boost::intrusive_ptr) and
That pretty-printer calls back into the inferior (being debugged) process.
You can disable all pretty-printers by e.g. disable pretty-printer. If that helps, you should probably figure out which exact pretty-printer is doing this, and contact its author.

Corrupt stack problem in C/C++ program

I am running a C/C++ program in linux servers to serve videos. The program's(say named Plugin) core functionality is to convert videos and we fork a separate Plugin process for each video request. But I am having a weird problem for which sometimes server load average gets unexpectedly high. What I see from top command at this stage is that there are some processes which are running for long time and taking some huge CPU's.
When I debug this running program with gdb and backtrace stack,what I found is the corrupt stack: "Previous frame inner to this frame (corrupt stack?)". I have searched the net and found that this occurs if the program gets segmentation fault.
But what I know if the program gets segmentation fault, the program should crash and exit at that point. But surprisingly the program still running after segmentation fault.
What can be the causes of this? I know there must be some big problems in the program but I just can't understand from where to start fixing the problem...It would be great if any of you can show me some lights...
Thanks in advance
Attaching the debugger changes the behavior of the process so you won't get reliable investigation results most probably. Corrupted stack message from the debugger can mean that the particular debugger does not understand text info from the binary.
I would recommend running pstack several time subsequently on the problematic (this is known as "Monte Carlo performance profiling") and also attach strace or truss to the problematic and check what system calls is the process doing when consuming CPU.
Run your program under Valgrind and fix any invalid memory writes that it finds.
Certain optimisations, such as frame pointer omission, can make it harder for the debugger to understand the stack.
If you have the code, compile the program in debug and run Valgrind on it.
If you don't have the code, contact the author/provider of the program.
The corrupt stack message simply means the code is doing something weird with the memory. It does not mean the program has a segmentation fault. Also, the program can still run if it choose to handle the SIGSEGV signal.
If by forking you mean that you have some process which spawn and run other smaller processes, just monitor for such spikes and restart the process. This assumes that you have no access to the fix the program.
There could be some interesting manipulation of the stack taking place through assembly code manipulation, such as true tail-recursion optimization, self-modifying code, non-returning functions, etc. that may cause the debugger to be incapable of properly back-tracing the stack and causing it to trigger a corrupted stack error, but that doesn't necessarily mean that memory is corrupted ... but definitely something non-traditional is happening under the hood.

What causes a Sigtrap in a Debug Session

In my c++ program I'm using a library which will "send?" a Sigtrap on a certain operations when
I'm debugging it (using gdb as a debugger). I can then choose whether I wish to Continue or Stop the program. If I choose to continue the program works as expected, but setting custom breakpoints after a Sigtrap has been caught causes the debugger/program to crash.
So here are my questions:
What causes such a Sigtrap? Is it a leftover line of code that can be removed, or is it caused by the debugger when he "finds something he doesn't like" ?
Is a sigtrap, generally speaking, a bad thing, and if so, why does the program run flawlessly when I compile a Release and not a Debug Version?
What does a Sigtrap indicate?
This is a more general approach to a question I posted yesterday Boost Filesystem: recursive_directory_iterator constructor causes SIGTRAPS and debug problems.
I think my question was far to specific, and I don't want you to solve my problem but help me (and hopefully others) to understand the background.
Thanks a lot.
With processors that support instruction breakpoints or data watchpoints, the debugger will ask the CPU to watch for instruction accesses to a specific address, or data reads/writes to a specific address, and then run full-speed.
When the processor detects the event, it will trap into the kernel, and the kernel will send SIGTRAP to the process being debugged. Normally, SIGTRAP would kill the process, but because it is being debugged, the debugger will be notified of the signal and handle it, mostly by letting you inspect the state of the process before continuing execution.
With processors that don't support breakpoints or watchpoints, the entire debugging environment is probably done through code interpretation and memory emulation, which is immensely slower. (I imagine clever tricks could be done by setting pagetable flags to forbid reading or writing, whichever needs to be trapped, and letting the kernel fix up the pagetables, signaling the debugger, and then restricting the page flags again. This could probably support near-arbitrary number of watchpoints and breakpoints, and run only marginally slower for cases when the watchpoint or breakpoint aren't frequently accessed.)
The question I placed into the comment field looks apropos here, only because Windows isn't actually sending a SIGTRAP, but rather signaling a breakpoint in its own native way. I assume when you're debugging programs, that debug versions of system libraries are used, and ensure that memory accesses appear to make sense. You might have a bug in your program that is papered-over at runtime, but may in fact be causing further problems elsewhere.
I haven't done development on Windows, but perhaps you could get further details by looking through your Windows Event Log?
While working in Eclipse with minGW/gcc compiler, I realized it's reacting very bad with vectors in my code, resulting to an unclear SIGTRAP signal and sometimes even showing abnormal debugger behavior (i.e. jumping somewhere up in the code and continuing execution of the code in reverse order!).
I have copied the files from my project into the VisualStudio and resolved the issues, then copied the changes back to eclipse and voila, worked like a charm. The reasons were like vector initialization differences with reserve() and resize() functions, or trying to access elements out of the bounds of the vector array.
Hope this will help someone else.
I received a SIGTRAP from my debugger and found out that the cause was due to a missing return value.
string getName() { printf("Name!");};