How to interpret a GDB backtrace? - c++

0x004069f1 in Space::setPosition (this=0x77733cee, x=-65, y=-49) at space.h:44
0x00402679 in Checkers::make_move (this=0x28cbb8, move=...) at checkers.cc:351
0x00403fd2 in main_savitch_14::game::make_computer_move (this=0x28cbb8) at game.cc:153
0x00403b70 in main_savitch_14::game::play (this=0x28cbb8) at game.cc:33
0x004015fb in _fu0___ZSt4cout () at checkers.cc:96
0x004042a7 in main () at main.cc:34
Hello, I am coding a game for a class and I am running into a segfault. The checker pieces are held in a two dimensional array, so the offending bit appears to be invalid x/y for the array. The moves are passed as strings, which are converted to integers, thus for the x and y were somehow ASCII NULL. I noticed that in the function call make_move it says move=...
Why does it say move=...? Also, any other quick tips of solving a segfault? I am kind of new to GDB.

Basically, the backtrace is trace of the calls that lead to the crash. In this case:
game::play called game::make_computer_move which called Checkers::make_move which called Space::setPosition which crashed in line 44 in file space.h.
Taking a look at this backtrace, it looks like you passed -65 and -49 to Space::setPosition, which if they happen to be invalid coordinates (sure look suspicious to me being negative and all). Then you should look in the calling functions to see why they have the values that they do and correct them.
I would suggest using assert liberally in the code to enforce contracts, pretty much any time you can say "this parameter or variable should only have values which meet certain criteria", then you should assert that it is the case.
A common example is if I have a function which takes a pointer (or more likely smart pointer) which is not allowed to be NULL. I'll have the first line of the function assert(p);. If a NULL pointer is ever passed, I know right away and can investigate.
Finally, run the application in gdb, when it crashes. Type up to inspect the calling stack frame and see what the variables looked like: (you can usually write things like print x in the console). likewise, down will move down the call stack if you need to as well.
As for SEGFAULT, I would recommend runnning the application in valgrind. If you compile with debugging information -g, then it often can tell you the line of code that is causing the error (and can even catch errors that for unfortunate reasons don't crash right away).

I am not allowed to comment, but just wanted to reply for anyone looking more recently on the issue trying to find where the variables become (-65, -49). If you are getting a segfault you can get a core dump. Here is a pretty good source for making sure you can set up gdb to get a core dump. Then you can open your core file with gdb:
gdb -c myCoreFile
Then set a breakpoint on your function call you'd like to step into:
b MyClass::myFunctionCall
Then step through with next or step to maneuver through lines of code:
step
or
next
When you are at a place in your code that you'd like to evaluate a variable you can print it:
p myVariable
or you can print all arguments:
info args
I hope this helps someone else looking to debug!

Related

Dummy call makes a difference in Fortran program?

I'm working on a Fortran program and running into a strange bug with some Heisenbug-type characteristics, and looking for some insight into what might be going on. The code is too large to post in full but I hopefully I can the general idea.
What's basically going on is I have a subroutine that reads a list of numerical parameters from a text file,
call read_parameters(filename, parameter_array)
and then this list of parameters is sent into another subroutine that runs a program using those parameter values.
call run_program(parameter_array)
These calls are part of a loop that calls run_program with slightly different parameters each time through the loop---the intention is to find better parameter sets.
I've found that on the first pass through this loop, run_program gives bizarre results, which seems to indicate that something is going wrong with the first call to read_parameters. But all the subsequent passes behave normally and I haven't been able to understand what's going wrong with that first pass despite a lot of investigating, including for example printing the values of the parameters themselves within the actual run_program code.
While testing, I realized that if I put another call to read_parameters right above the call to run_program, then the first pass of the program runs normally, but here's the thing: this new call to read_parameters is just a dummy call, with an output array parameter_array2 that doesn't even get used! As in,
call read_parameters(parameter_array)
call read_parameters(parameter_array2)
call run_program(parameter_array)
If the second line is present, the program runs just fine, even though parameter_array2 isn't used anywhere, while if it's absent the program gives erroneous results for the first pass through the loop.
Does anyone have any ideas about what might be going on?
Thanks.

Fortran script only runs when print statement added

I am running an atmospheric model, and need to compile an executable to convert some files. If I compile the code as supplied, it runs but it gets stuck and doesn't ever complete. It doesn't give an error or anything like that.
After doing some testing by adding print statements to see where it was getting stuck, I've found that the executable only runs if I compile the code with a print statement in one of the subroutines.
The piece of code in question is the one here. Specifically, the code fails to run unless I put a print statement somewhere in the get_bottom_top_dim subroutine.
Does anyone know why this might be? It doesn't matter what the print statement is (currently I'm using print*, '!'). but as soon as I remove it or comment it out, the code no longer works.
I'm assuming it must have something to do with my machine or compiler (ifort 12.1.0), but I'm stumped as to what the problem is!
This is an extended comment rather than an answer:
The situation you describe, inserting a print statement which apparently fixes a program, often arises when the underlying problem is due to either
a) an attempt to access an element outside the declared bounds of an array; or
b) a mismatch between dummy and actual arguments to some procedure.
Recompile your program with the compiler options to check interfaces at compile-time and to check array bounds at run-time.
Fortran has evolved a LOT since I last used it but here's how to go about solving your problem.
Think of some hypotheses that could explain the symptoms, e.g. the compiler is optimizing the subroutine down to a no-op when it has no print side effect. Or a compiler bug is translating this code into something empty or an infinite loop or crashing code. (What exactly do you mean by "fails to run"?) Or the Linker is failing to link in some needed code unless the subroutine explicitly calls print.
Or there's a bug in this subroutine and the print statement alters its symptoms e.g. by changing which data gets overwritten by an index-out-of-bounds bug.
Think of ways to test these hypotheses. You might already have observations adequate to rule out of some of them. You could decompile the object code to see if this subroutine is empty. Or step through it in a debugger. Or replace the print statement with a different side effect like logging to a file or to an in-memory text buffer.
Or turn on all optional runtime memory checks and compile time warnings. Or simplify the code until the problem goes away, then binary search on bringing back code until the problem recurs.
Do the most likely or easiest tests first. Rule out some hypotheses, and iterate.
I had a similar bug and I found that the problem was in the dependencies on the makefile.
This was what I had:
I set a variable with a value and the program stops.
I write a print command and it works.
I delete the print statement and continues to work.
I alter the variable value and stops.
The thing is, the variable value is set in a parameters.f90
The print statement is in a file H3.f90 that depends on parameters.f90 but it was not declared on the makefile.
After correcting:
h3.o: h3.f90 variables.f90 parameters.f90
$(FC) -c h3.f90
It all worked properly.

Setting a Breakpoint in GDB

I have a function that returns a pointer:
static void *find_fit(size_t asize);
I would like to set a breakpoint in gdb, but when I type this function name, I get one of these errors:
break *find_fit
Function "*find_fit" not defined
or
break find_fit
Function "find_fit" not defined
I can easily set break point on a function that returns something other than a pointer, but when the function does return a pointer, gdb doesn't seem to want to break on it.
Anybody see what is going on? Thanks!
It sounds like for some reason, gdb isn't handling C++ name mangling correctly. Normally you don't have to touch anything for this to work. You can try show language. Typically it's set to auto. You can also try manually setting it with set language c++.
To test, you can just type
b 'find<tab>
(that's the tab character, not the characters "<tab>") and it should try to autocomplete the name of the function for you. In C++ you need the argument types to know the function, but that doesn't 100% fit with what you're seeing because if you give gdb a function name without arguments, it'll usually do the right thing or prompt you for which version of a function you want. You aren't seeing either of those.

Remove never-run call to templated function, get allocation error on run-time

I have a piece of templated code that is never run, but is compiled. When I remove it, another part of my program breaks.
First off, I'm a bit at a loss as to how to ask this question. So I'm going to try throwing lots of information at the problem.
Ok, so, I went to completely redesign my test project for my experimental core library thingy. I use a lot of template shenanigans in the library. When I removed the "user" code, the tests gave me a memory allocation error. After quite a bit of experimenting, I narrowed it down to this bit of code (out of a couple hundred lines):
void VOODOO(components::switchBoard &board) {
board.addComponent<using_allegro::keyInputs<'w'> >();
}
Fundementally, what's weirding me out is that it appears that the act of compiling this function (and the template function it then uses, and the template functions those then use...), makes this bug not appear. This code is not being run. Similar code (the same, but for different key vals) occurs elsewhere, but is within Boost TDD code.
I realize I certainly haven't given enough information for you to solve it for me; I tried, but it more-or-less spirals into most of the code base. I think I'm most looking for "here's what the problem could be", "here's where to look", etc. There's something that's happening during compile because of this line, but I don't know enough about that step to begin looking.
Sooo, how can a (presumably) compilied, but never actually run, bit of templated code, when removed, cause another part of code to fail?
Error:
Unhandled exceptionat 0x6fe731ea (msvcr90d.dll) in Switchboard.exe:
0xC0000005: Access violation reading location 0xcdcdcdc1.
Callstack:
operator delete(void * pUser Data)
allocator< class name related to key inputs callbacks >::deallocate
vector< same class >::_Insert_n(...)
vector< " " >::insert(...)
vector<" ">::push_back(...)
It looks like maybe the vector isn't valid, because _MyFirst and similar data members are showing values of 0xcdcdcdcd in the debugger. But the vector is a member variable...
Update: The vector isn't valid because it's never made. I'm getting a channel ID value stomp, which is making me treat one type of channel as another.
Update:
Searching through with the debugger again, it appears that my method for giving each "channel" it's own, unique ID isn't giving me a unique ID:
inline static const char channel<template args>::idFunction() {
return reinterpret_cast<char>(&channel<CHANNEL_IDENTIFY>::idFunction);
};
Update2: These two are giving the same:
slaveChannel<switchboard, ALLEGRO_BITMAP*, entityInfo<ALLEGRO_BITMAP*>
slaveChannel<key<c>, char, push<char>
Sooo, having another compiled channel type changing things makes sense, because it shifts around the values of the idFunctions? But why are there two idFunctions with the same value?
you seem to be returning address of the function as a character? that looks weird. char has much smaller bit count than pointer, so it's highly possible you get same values. that could reason why changing code layout fixes/breaks your program
As a general answer (though aaa's comment alludes to this): When something like this affects whether a bug occurs, it's either because (a) you're wrong and it is being run, or (b) the way that the inclusion of that code happens to affect your code, data, and memory layout in the compiled program causes a heisenbug to change from visible to hidden.
The latter generally occurs when something involves undefined behavior. Sometimes a bogus pointer value will cause you to stomp on a bit of your code (which might or might not be important depending on the code layout), or sometimes a bogus write will stomp on a value in your data stack that might or might not be a pointer that's used later, or so forth.
As a simple example, supposing you have a stack that looks like:
float data[10];
int never_used;
int *important pointer;
And then you erroneously write
data[10] = 0;
Then, assuming that stack got allocated in linear order, you'll stomp on never_used, and the bug will be harmless. However, if you remove never_used (or change something so the compiler knows it can remove it for you -- maybe you remove a never-called function call that would use it), then it will stomp on important_pointer instead, and you'll now get a segfault when you dereference it.

How to debug a segmentation fault while the gdb stack trace is full of '??'?

My executable contains symbol table. But it seems that the stack trace is overwrited.
How to get more information out of that core please? For instance, is there a way to inspect the heap ? See the objects instances populating the heap to get some clues. Whatever, any idea is appreciated.
I am a C++ programmer for a living and I have encountered this issue more times than i like to admit. Your application is smashing HUGE part of the stack. Chances are the function that is corrupting the stack is also crashing on return. The reason why is because the return address has been overwritten, and this is why GDB's stack trace is messed up.
This is how I debug this issue:
1)Step though the application until it crashes. (Look for a function that is crashing on return).
2)Once you have identified the function, declare a variable at the VERY FIRST LINE of the function:
int canary=0;
(The reason why it must be the first line is that this value must be at the very top of the stack. This "canary" will be overwritten before the function's return address.)
3) Put a variable watch on canary, step though the function and when canary!=0, then you have found your buffer overflow! Another possibility it to put a variable breakpoint for when canary!=0 and just run the program normally, this is a little easier but not all IDE's support variable breakpoints.
EDIT: I have talked to a senior programmer at my office and in order to understand the core dump you need to resolve the memory addresses it has. One way to figure out these addresses is to look at the MAP file for the binary, which is human readable. Here is an example of generating a MAP file using gcc:
gcc -o foo -Wl,-Map,foo.map foo.c
This is a piece of the puzzle, but it will still be very difficult to obtain the address of function that is crashing. If you are running this application on a modern platform then ASLR will probably make the addresses in the core dump useless. Some implementation of ASLR will randomize the function addresses of your binary which makes the core dump absolutely worthless.
You have to use some debugger to detect, valgrind is ok
while you are compiling your code make sure you add -Wall option, it makes compiler will tell you if there are some mistakes or not (make sure you done have any warning in your code).
ex: gcc -Wall -g -c -o oke.o oke.c
3. Make sure you also have -g option to produce debugging information. You can call debugging information using some macros. The following macros are very useful for me:
__LINE__ : tells you the line
__FILE__ : tells you the source file
__func__ : tells yout the function
Using the debugger is not enough I think, you should get used to to maximize compiler ablity.
Hope this would help
TL;DR: extremely large local variable declarations in functions are allocated on the stack, which, on certain platform and compiler combinations, can overrun and corrupt the stack.
Just to add another potential cause to this issue. I was recently debugging a very similar issue. Running gdb with the application and core file would produce results such as:
Core was generated by `myExecutable myArguments'.
Program terminated with signal 6, Aborted.
#0 0x00002b075174ba45 in ?? ()
(gdb)
That was extremely unhelpful and disappointing. After hours of scouring the internet, I found a forum that talked about how the particular compiler we were using (Intel compiler) had a smaller default stack size than other compilers, and that large local variables could overrun and corrupt the stack. Looking at our code, I found the culprit:
void MyClass::MyMethod {
...
char charBuffer[MAX_BUFFER_SIZE];
...
}
Bingo! I found MAX_BUFFER_SIZE was set to 10000000, thus a 10MB local variable was being allocated on the stack! After changing the implementation to use a shared_ptr and create the buffer dynamically, suddenly the program started working perfectly.
Try running with Valgrind memory debugger.
To confirm, was your executable compiled in release mode, i.e. no debug symbols....that could explain why there's ?? Try recompiling with -g switch which 'includes debugging information and embedding it into the executable'..Other than that, I am out of ideas as to why you have '??'...
Not really. Sure you can dig around in memory and look at things. But without a stack trace you don't know how you got to where you are or what the parameter values were.
However, the very fact that your stack is corrupt tells you that you need to look for code that writes into the stack.
Overwriting a stack array. This can be done the obvious way or by calling a function or system call with bad size arguments or pointers of the wrong type.
Using a pointer or reference to a function's local stack variables after that function has returned.
Casting a pointer to a stack value to a pointer of the wrong size and using it.
If you have a Unix system, "valgrind" is a good tool for finding some of these problems.
I assume that since you say "My executable contains symbol table" that you compiled and linked with -g, and that your binary wasn't stripped.
We can just confirm this:
strings -a |grep function_name_you_know_should_exist
Also try using pstack on the core ans see if it does a better job of picking up the callstack. In that case it sounds like your gdb is out of date compared to your gcc/g++ version.
Sounds like you're not using the identical glibc version on your machine as the corefile was when it crashed on production. Get the files output by "ldd ./appname" and load them onto your machine, then tell gdb where to look;
set solib-absolute-prefix /path/to/libs