Determine the line of code that causes a segmentation fault? - c++

How does one determine where the mistake is in the code that causes a segmentation fault?
Can my compiler (gcc) show the location of the fault in the program?

GCC can't do that but GDB (a debugger) sure can. Compile you program using the -g switch, like this:
gcc program.c -g
Then use gdb:
$ gdb ./a.out
(gdb) run
<segfault happens here>
(gdb) backtrace
<offending code is shown here>
Here is a nice tutorial to get you started with GDB.
Where the segfault occurs is generally only a clue as to where "the mistake which causes" it is in the code. The given location is not necessarily where the problem resides.

Also, you can give valgrind a try: if you install valgrind and run
valgrind --leak-check=full <program>
then it will run your program and display stack traces for any segfaults, as well as any invalid memory reads or writes and memory leaks. It's really quite useful.

You could also use a core dump and then examine it with gdb. To get useful information you also need to compile with the -g flag.
Whenever you get the message:
Segmentation fault (core dumped)
a core file is written into your current directory. And you can examine it with the command
gdb your_program core_file
The file contains the state of the memory when the program crashed. A core dump can be useful during the deployment of your software.
Make sure your system doesn't set the core dump file size to zero. You can set it to unlimited with:
ulimit -c unlimited
Careful though! that core dumps can become huge.

There are a number of tools available which help debugging segmentation faults and I would like to add my favorite tool to the list: Address Sanitizers (often abbreviated ASAN).
Modern¹ compilers come with the handy -fsanitize=address flag, adding some compile time and run time overhead which does more error checking.
According to the documentation these checks include catching segmentation faults by default. The advantage here is that you get a stack trace similar to gdb's output, but without running the program inside a debugger. An example:
int main() {
volatile int *ptr = (int*)0;
*ptr = 0;
}
$ gcc -g -fsanitize=address main.c
$ ./a.out
AddressSanitizer:DEADLYSIGNAL
=================================================================
==4848==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x5654348db1a0 bp 0x7ffc05e39240 sp 0x7ffc05e39230 T0)
==4848==The signal is caused by a WRITE memory access.
==4848==Hint: address points to the zero page.
#0 0x5654348db19f in main /tmp/tmp.s3gwjqb8zT/main.c:3
#1 0x7f0e5a052b6a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x26b6a)
#2 0x5654348db099 in _start (/tmp/tmp.s3gwjqb8zT/a.out+0x1099)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/tmp.s3gwjqb8zT/main.c:3 in main
==4848==ABORTING
The output is slightly more complicated than what gdb would output but there are upsides:
There is no need to reproduce the problem to receive a stack trace. Simply enabling the flag during development is enough.
ASANs catch a lot more than just segmentation faults. Many out of bounds accesses will be caught even if that memory area was accessible to the process.
¹ That is Clang 3.1+ and GCC 4.8+.

All of the above answers are correct and recommended; this answer is intended only as a last-resort if none of the aforementioned approaches can be used.
If all else fails, you can always recompile your program with various temporary debug-print statements (e.g. fprintf(stderr, "CHECKPOINT REACHED # %s:%i\n", __FILE__, __LINE__);) sprinkled throughout what you believe to be the relevant parts of your code. Then run the program, and observe what the was last debug-print printed just before the crash occurred -- you know your program got that far, so the crash must have happened after that point. Add or remove debug-prints, recompile, and run the test again, until you have narrowed it down to a single line of code. At that point you can fix the bug and remove all of the temporary debug-prints.
It's quite tedious, but it has the advantage of working just about anywhere -- the only times it might not is if you don't have access to stdout or stderr for some reason, or if the bug you are trying to fix is a race-condition whose behavior changes when the timing of the program changes (since the debug-prints will slow down the program and change its timing)

Lucas's answer about core dumps is good. In my .cshrc I have:
alias core 'ls -lt core; echo where | gdb -core=core -silent; echo "\n"'
to display the backtrace by entering 'core'. And the date stamp, to ensure I am looking at the right file :(.
Added: If there is a stack corruption bug, then the backtrace applied to the core dump is often garbage. In this case, running the program within gdb can give better results, as per the accepted answer (assuming the fault is easily reproducible). And also beware of multiple processes dumping core simultaneously; some OS's add the PID to the name of the core file.

This is a crude way to find the exact line after which there was the segmentation fault.
Define line logging function
#include \<iostream>
void log(int line) {
std::cout << line << std::endl;
}
find and replace all the semicolon after the log function with "; log(_LINE_);"
Make sure that the semicolons replaced with functions in the for (;;) loops are removed

If you have a reproducible exception like segmentation fault, you can use a tool like a debugger to reproduce the error.
I used to find source code location for even non-reproducible error. It's based on the Microsoft compiler tool chain. But it's based on a idea.
Save the MAP file for each binary (DLL,EXE) before you give it to the customer.
If an exception occurs, lookup the address in the MAP file and determine the function whose start address is just below the exception address. As a result you know the function, where the exception occurred.
Subtract the function start address from the exception address. The result is the offset in the function.
Recompile the source file containing the function with assembly listing enabled. Extract the function's assembly listing.
The assembly includes the offset of each instruction in the function. Lookup the source code line, that matches the offset in the function.
Evaluate the assembler code for the specific source code line. The offset points exactly the assembler instruction that caused the thrown exception. Evaluate the code of this single source code line. With a bit of experience with the compiler output you can say what caused the exception.
Be aware the reason for the exception might be at a totally different location. e.g. the code dereferenced a NULL pointer, but the actual reason, why the pointer is NULL can be somewhere else.
The steps 6. and 7. are beneficial since you asked only for the line of code. But I recommend that you should be aware of it.
I hope you get a similar environment with the GCC compiler for your platform. If you don't have a usable MAP file, use the tool chain tools to get the addresses of the the function. I am sure the ELF file format supports this.

In case any of you (like me!) were looking for this same question but with gfortran, not gcc, the compiler is much more powerful these days and before resorting to the use of the debugger, you can also try out these compile options. For me, this identified exactly the line of code where the error occurred and which variable I was accessing out of bounds to cause the segmentation fault error.
-O0 -g -Wall -fcheck=all -fbacktrace

Related

Function call missing from C stack trace

I'm importing a stack-tracing C code (found somewhere on Stack Overflow) in my code to trace where memory blocks have been allocated:
struct layout
{
struct layout *ebp;
void *ret;
};
struct layout *fr;
__asm__("movl %%ebp, %[fp]" : /* output */ [fp] "=r" (fr));
for (int i=1 ; i<8 && (unsigned char*) fr > dsRAM; i++) {
x[i] = (size_t) fr->ret;
fr = fr->ebp;
}
Things work fairly well, except that in some calls, the code is missing some functions near the top of the stack, e.g. GDB will report:
malloc() at main.cpp
operator new() from libstdc++.so.6
TestBasicScript() at BasicScript.cpp
main() at main.cpp
While the code fills x[] with the addresses of malloc, new operator and main(), missing TestBasicScript.
The code got compiled by g++ 4.5.1 (old devkit for homebrew console programming) with the following flags:
CFLAGS += -I libgeds/source/ -I wrappers -I $(DEVKITPRO)/include -DARM9 \
-include wrappers/nds/system.h -include wrappers/fake.h
CFLAGS += -m32 -Duint=uint32_t -g -Wall -Weffc++ -fno-omit-frame-pointer
I tried to use __builtin_return_address() instead, but I get pretty much the same result with much longer code.
EDIT: I noted that I'm systematically missing the caller of operator new, which could be explained if the code of _Znwj don't setup a stack frame. So the list of questions become :
How does GDB manage to find that TestBasicScript() function call if it's not in the stack frames list ?
How do I configure linking steps so that debug-friendly variant of libstdc++ (if any) is used ?
Original sub-question "Is there compile-time options that guarantee I can trace 100% of the calls to my malloc clone ?" is thus answered by #chqrlie: -O0 is all I should need. But it will be effective only if applied on all my binaries, shared libraries included.
There are many reasons why some frames might be omitted, like for example inlining and optimization (although the provided CFLAGS do not contain optimization flags and the default is AFAIK no optimization).
Anyway, for GCC there is builtin support of stack walking, by using backtrace(), backtrace_symbols() and perhaps combined with abi::__cxa_demangle(), you can try those as well.
Other option is to use libunwind, I was trying it as well with quite good results (and in its source code you can see some useful techniques for in-app stack walking).
All the above usually don't work very well with optimized (release) executables, in particular if they do not contain the debug info (although it might have been generated and stored aside) the printed stack will be useless (besides skipped frames because of the optimization).
An ultimate technique which works even for optimized code is generating a core dump. There you have all the information about the stack (the binary itself does not need to contain the debuginfo, it just can be left aside and only used for examining the core offline), and as a bonus values of all variables on the stack, information about all threads currently running etc.
For tracing memory allocations it is probably an overkill (it is also quite slow), but sometimes it can be pretty useful. In one of my projects I created a working implementation of such core dumper which is still present in the production code.
Note that you can actually generate a core dump of the app without terminating the application - the implementation I created basically works as follows:
fork() the process at the point where the core dump should be generated
the child process calls abort() to generate the core dump (the call stack of the forked process is the same as the original process), i.e. only the forked process is terminated by the abort()
the original parent process uses waitpid() to wait until the child process generates the core dump and terminates (with a guard counter to not wait forever)
then the original process continues running (and writes to the log that the diagnostic core has been generated along with the PID of the forked process which was used to generate the core)
This turned out to work pretty well in some situations where a diagnostic stack trace was required for release production application.
EDIT: Another option which I also tried is using ptrace() (if I remember well, that is also one of the techniques used by the libunwind mentioned above and actually also by GDB). That works the similar way - spawning a child process by fork() and then calling ptrace(PTRACE_TRACEME) in there; the parent process can then issue various ptrace() calls to examine the stack of the child (which happens to be the same as the stack of the parent at the point of fork()). I think the libunwind source code contain its use so you can examine it there.
The compiler may not always generate a stack frame with %ebp pointing the the previous frame. For some functions, it may generate code that uses %esp based addressing to retrieve the arguments, for others it may generate tail recursion with a jump instead of a call/ret sequence. The stack trace as you are trying to scan it may be incomplete.
Try compiling the whole project with optimisation disabled (-O0).

how to make GCC print helpful RUNTIME error messages?

#defineing _GLIBCXX_DEBUG forces GCC to catch a large class of runtime errors in C++, such as out-of-bounds STL access, invalid iterators, etc.
Unfortunately, when the error happens, the message printed is not very helpful. I know how to print a backtrace with a function, and __FILE__ and __LINE__ with a macro myself.
Is there an easy way to convince GCC to do that, or to specify a function/macro for it to call when the kind of errors that _GLIBCXX_DEBUG catches actually occur?
I assume you mean you want messages that print the context of use in your code, rather than the filename and line number of some internal header file used by GCC.
There appears to be a single macro in .../debug/macros.h that all the checking code uses called _GLIBCXX_DEBUG_VERIFY. You could modify it to suit your needs.
Edit: Jonathan Wakely points out that all checks are fatal.
When a Debug Mode check fails it calls abort(), so it dumps a core file which you can easily examine with a debugger to see where it failed. If you run the program in a debugger it will stop when it aborts and you can print the stack trace with backtrace.
To make that automatic you would need to change the call to abort() (in libstdc++-v3/src/c++11/debug.cc). I think you could change it to call std::terminate() and then install your own terminate_handler with set_terminate to make it print a backtrace.

How to interpret a GDB backtrace?

0x004069f1 in Space::setPosition (this=0x77733cee, x=-65, y=-49) at space.h:44
0x00402679 in Checkers::make_move (this=0x28cbb8, move=...) at checkers.cc:351
0x00403fd2 in main_savitch_14::game::make_computer_move (this=0x28cbb8) at game.cc:153
0x00403b70 in main_savitch_14::game::play (this=0x28cbb8) at game.cc:33
0x004015fb in _fu0___ZSt4cout () at checkers.cc:96
0x004042a7 in main () at main.cc:34
Hello, I am coding a game for a class and I am running into a segfault. The checker pieces are held in a two dimensional array, so the offending bit appears to be invalid x/y for the array. The moves are passed as strings, which are converted to integers, thus for the x and y were somehow ASCII NULL. I noticed that in the function call make_move it says move=...
Why does it say move=...? Also, any other quick tips of solving a segfault? I am kind of new to GDB.
Basically, the backtrace is trace of the calls that lead to the crash. In this case:
game::play called game::make_computer_move which called Checkers::make_move which called Space::setPosition which crashed in line 44 in file space.h.
Taking a look at this backtrace, it looks like you passed -65 and -49 to Space::setPosition, which if they happen to be invalid coordinates (sure look suspicious to me being negative and all). Then you should look in the calling functions to see why they have the values that they do and correct them.
I would suggest using assert liberally in the code to enforce contracts, pretty much any time you can say "this parameter or variable should only have values which meet certain criteria", then you should assert that it is the case.
A common example is if I have a function which takes a pointer (or more likely smart pointer) which is not allowed to be NULL. I'll have the first line of the function assert(p);. If a NULL pointer is ever passed, I know right away and can investigate.
Finally, run the application in gdb, when it crashes. Type up to inspect the calling stack frame and see what the variables looked like: (you can usually write things like print x in the console). likewise, down will move down the call stack if you need to as well.
As for SEGFAULT, I would recommend runnning the application in valgrind. If you compile with debugging information -g, then it often can tell you the line of code that is causing the error (and can even catch errors that for unfortunate reasons don't crash right away).
I am not allowed to comment, but just wanted to reply for anyone looking more recently on the issue trying to find where the variables become (-65, -49). If you are getting a segfault you can get a core dump. Here is a pretty good source for making sure you can set up gdb to get a core dump. Then you can open your core file with gdb:
gdb -c myCoreFile
Then set a breakpoint on your function call you'd like to step into:
b MyClass::myFunctionCall
Then step through with next or step to maneuver through lines of code:
step
or
next
When you are at a place in your code that you'd like to evaluate a variable you can print it:
p myVariable
or you can print all arguments:
info args
I hope this helps someone else looking to debug!

Tracking down the source code line of a crash from a non-debug built module

I have a widows crash-dump with a call stack showing me the module!functionname+offset of the function that caused the crash. The module is built without debug information using gcc.
The cause of the crash is an exception caused by a failed to write at a given address, i.e access violation(05), write violation(01)
On my development machine I have access to the same module built with debugging information. What I'm looking for is a way to track down the corresponding source code line that caused the crash, this by using the module!functionname+offset information as starting point.
The method name of the top frame in the call stack is a class destructor
The mangled function name is _ZN20ViewErrorDescriptionD0Ev+x79
Running objdump -d searching for the module!functionname+offset gives:
.... call *%eax
.... mov 0xffffffbc(%ebp), %eax
.... cmpl 0x0, 0x148(%eax)
trying to find this in the debug built file gives no match
The source code of the destructor only contains two delete pointerX calls.
Using gdb to load the debug built module(sharedlibrary) and then calling info line gives me a starting and ending address, using grep on the objdump output shows the corresponding disassembled code, which looks quite much like the one from the module without debug info, but still far from the same.
!NB - The output from info line says _ZN20ViewErrorDescriptionD2Ev not _ZN20ViewErrorDescriptionD0Ev as the crash dump says.
Taken from the ABI documentation:
::= D1 # complete object destructor
::= D2 # base object destructor
Where do I go from here?
Best regards
Kristofer H
Unfortunately even debug/non-debug builds may have different address layouts. The only way I'm aware of to accomplish something like this is to build with debug symbols and save off a copy of that binary. Then you can deploy a stripped version without the debug information.
Your approach attempting to locate the assembly code seems the most hopeful here. I would expand that even though: Try to look at a much larger chunk of assembly in the crashed file and see if you can generate more context yourself rather than having the computer attempt to match low-level instructions that might in fact slightly differ.
This works on the assumption that gcc compilation is 100% deterministic. I'm not sure how valid that assumption is. However, taking the further assumption that you still have exactly the same source code you could try enabling the gcc's -S command line option and rebuilding. This will result in a set of .s files, one for each source file, containing the assembly code. You can then search through this for the code machine code that you want to find.

How to debug a segmentation fault while the gdb stack trace is full of '??'?

My executable contains symbol table. But it seems that the stack trace is overwrited.
How to get more information out of that core please? For instance, is there a way to inspect the heap ? See the objects instances populating the heap to get some clues. Whatever, any idea is appreciated.
I am a C++ programmer for a living and I have encountered this issue more times than i like to admit. Your application is smashing HUGE part of the stack. Chances are the function that is corrupting the stack is also crashing on return. The reason why is because the return address has been overwritten, and this is why GDB's stack trace is messed up.
This is how I debug this issue:
1)Step though the application until it crashes. (Look for a function that is crashing on return).
2)Once you have identified the function, declare a variable at the VERY FIRST LINE of the function:
int canary=0;
(The reason why it must be the first line is that this value must be at the very top of the stack. This "canary" will be overwritten before the function's return address.)
3) Put a variable watch on canary, step though the function and when canary!=0, then you have found your buffer overflow! Another possibility it to put a variable breakpoint for when canary!=0 and just run the program normally, this is a little easier but not all IDE's support variable breakpoints.
EDIT: I have talked to a senior programmer at my office and in order to understand the core dump you need to resolve the memory addresses it has. One way to figure out these addresses is to look at the MAP file for the binary, which is human readable. Here is an example of generating a MAP file using gcc:
gcc -o foo -Wl,-Map,foo.map foo.c
This is a piece of the puzzle, but it will still be very difficult to obtain the address of function that is crashing. If you are running this application on a modern platform then ASLR will probably make the addresses in the core dump useless. Some implementation of ASLR will randomize the function addresses of your binary which makes the core dump absolutely worthless.
You have to use some debugger to detect, valgrind is ok
while you are compiling your code make sure you add -Wall option, it makes compiler will tell you if there are some mistakes or not (make sure you done have any warning in your code).
ex: gcc -Wall -g -c -o oke.o oke.c
3. Make sure you also have -g option to produce debugging information. You can call debugging information using some macros. The following macros are very useful for me:
__LINE__ : tells you the line
__FILE__ : tells you the source file
__func__ : tells yout the function
Using the debugger is not enough I think, you should get used to to maximize compiler ablity.
Hope this would help
TL;DR: extremely large local variable declarations in functions are allocated on the stack, which, on certain platform and compiler combinations, can overrun and corrupt the stack.
Just to add another potential cause to this issue. I was recently debugging a very similar issue. Running gdb with the application and core file would produce results such as:
Core was generated by `myExecutable myArguments'.
Program terminated with signal 6, Aborted.
#0 0x00002b075174ba45 in ?? ()
(gdb)
That was extremely unhelpful and disappointing. After hours of scouring the internet, I found a forum that talked about how the particular compiler we were using (Intel compiler) had a smaller default stack size than other compilers, and that large local variables could overrun and corrupt the stack. Looking at our code, I found the culprit:
void MyClass::MyMethod {
...
char charBuffer[MAX_BUFFER_SIZE];
...
}
Bingo! I found MAX_BUFFER_SIZE was set to 10000000, thus a 10MB local variable was being allocated on the stack! After changing the implementation to use a shared_ptr and create the buffer dynamically, suddenly the program started working perfectly.
Try running with Valgrind memory debugger.
To confirm, was your executable compiled in release mode, i.e. no debug symbols....that could explain why there's ?? Try recompiling with -g switch which 'includes debugging information and embedding it into the executable'..Other than that, I am out of ideas as to why you have '??'...
Not really. Sure you can dig around in memory and look at things. But without a stack trace you don't know how you got to where you are or what the parameter values were.
However, the very fact that your stack is corrupt tells you that you need to look for code that writes into the stack.
Overwriting a stack array. This can be done the obvious way or by calling a function or system call with bad size arguments or pointers of the wrong type.
Using a pointer or reference to a function's local stack variables after that function has returned.
Casting a pointer to a stack value to a pointer of the wrong size and using it.
If you have a Unix system, "valgrind" is a good tool for finding some of these problems.
I assume that since you say "My executable contains symbol table" that you compiled and linked with -g, and that your binary wasn't stripped.
We can just confirm this:
strings -a |grep function_name_you_know_should_exist
Also try using pstack on the core ans see if it does a better job of picking up the callstack. In that case it sounds like your gdb is out of date compared to your gcc/g++ version.
Sounds like you're not using the identical glibc version on your machine as the corefile was when it crashed on production. Get the files output by "ldd ./appname" and load them onto your machine, then tell gdb where to look;
set solib-absolute-prefix /path/to/libs