MPI and Valgrind not showing line numbers - c++

I've written a large program and I'm having a really hard time tracking down a segmentation fault. I posted a question but I didn't have enough information to go on (see link below - and if you do, note that I spent almost an entire day trying several times to come up with a minimally compilable version of the code that reproduced the error to no avail).
https://stackoverflow.com/questions/16025411/phantom-bug-involving-stdvectors-mpi-c
So now I'm trying my hand at valgrind for the first time. I just installed it (simply "sudo apt-get install valgrind") with no special installation to account for MPI (if there is any). I'm hoping for concrete information including file names and line numbers (I understand it's impossible for valgrind to provide variable names). While I am getting useful information, including
Invalid read of size 4
Conditional jump or move depends on uninitialised value(s)
Uninitialised value was created by a stack allocation
4 bytes in 1 blocks are definitely lost
in addition to this magical thing
Syscall param sched_setaffinity(mask) points to unaddressable byte(s) at 0x433CE77: syscall (syscall.S:31) Address 0x0 is not stack'd, malloc'd or (recently) free'd
I am not getting file names and line numbers. Instead, I get
==15095== by 0x406909A: ??? (in /usr/lib/openmpi/lib/libopen-rte.so.0.0.0)
Here's how I compile my code:
mpic++ -Wall -Wextra -g -O0 -o Hybrid.out (…file names)
Here are two ways I've executed valgrind:
valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log.txt mpirun -np 1 Hybrid.out
and
mpirun -np 1 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out
The second version based on instructions in
Segmentation faults occur when I run a parallel program with Open MPI
which, if I'm understanding the chosen answer correctly, appears to be contradicted by
openmpi with valgrind (can I compile with MPI in Ubuntu distro?)
I am deliberately running valgrind on one processor because that's the only way my program will execute to completion without the segmentation fault. I have also run it with two processors, and my program seg faulted as expected, but the log I got back from valgrind seemed to contain essentially the same information. I'm hoping that by resolving the issues valgrind reports on one processor, I'll magically solve the issue happening on more than one.
I tried to include "-static" in the program compilation as suggested in
Valgrind not showing line numbers in spite of -g flag (on Ubuntu 11.10/VirtualBox)
but the compilation failed, saying (in addition to several warnings)
dynamic STT_GNU_IFUNC symbol "strcmp" with pointer equality in '…' can not be used when making an executably; recompile with fPIE and relink with -pie
I have not looked into what "fPIE" and "-pie" mean. Also, please note that I am not using a makefile, nor do I currently know how to write one.
A few more notes: My code does not use the commands malloc, calloc, or new. I'm working entirely with std::vector; no C arrays. I do use commands like .resize(), .insert(), .erase(), and .pop_back(). My code also passes vectors to functions by reference and constant reference. As for parallel commands, I only use MPI_Barrier(), MPI_Bcast(), and MPI_Allgatherv().
How do I get valgrind to show the file names and line numbers for the errors it is reporting? Thank you all for your help!
EDIT
I continued working on it and a friend of mine pointed out that the reports without line numbers are all coming from MPI files, which I did not compile from source, and since I did not compile them, I can't use the -g option, and hence, don't see lines. So I tried valgrind again based on this command,
mpirun -np 1 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out
but now for two processors, which is
mpirun -np 2 valgrind --tool=memcheck --leak-check=full --track-origins=yes --log-file=log4.txt -v ./Hybrid.out
The program ran to completion (I did not see the seg fault reported in the command line) but this execution of valgrind did give me line numbers within my files. The line valgrind is pointing to is a line where I call MPI_Bcast(). Is it safe to say that this appeared because the memory problem only manifests itself on multiple processors (since I've run it successfully on np -1)?

It sounds like you are using the wrong tool. If you want to know where a segmentation fault occurs use gdb.
Here's a simple example. This program will segfault at *b=5
// main.c
int
main(int argc, char** argv)
{
int* b = 0;
*b = 5;
return *b;
}
To see what happened using gdb; (the <---- part explains input lines)
svengali ~ % g++ -g -c main.c -o main.o # include debugging symbols in .o file
svengali ~ % g++ main.o -o a.out # executable is linked (no -g here)
svengali ~ % gdb a.out
GNU gdb (GDB) 7.4.1-debian
<SNIP>
Reading symbols from ~/a.out...done.
(gdb) run <--------------------------------------- RUNS THE PROGRAM
Starting program: ~/a.out
Program received signal SIGSEGV, Segmentation fault.
0x00000000004005a3 in main (argc=1, argv=0x7fffffffe2d8) at main.c:5
5 *b = 5;
(gdb) bt <--------------------------------------- PRINTS A BACKTRACE
#0 0x00000000004005a3 in main (argc=1, argv=0x7fffffffe2d8) at main.c:5
(gdb) print b <----------------------------------- EXAMINE THE CONTENTS OF 'b'
$2 = (int *) 0x0
(gdb)

Related

How to find the cause of this segmentation fault using gdb and core-dump file?(Limitation of GDB)

I know I can use core dump file to figure out where the program goes wrong. However, there are some bugs that even you debug it with core file, you still don't know why it goes wrong. So what I want to convey is that the scope of the bugs that gdb and core files can help you to debug is limited. And how limited is that?
For example, I write the following code : (libfoo.c)
#include <stdio.h>
#include <stdlib.h>
void foo(void);
int main()
{
puts("This is a mis-compiled runnable shared library");
return 0;
}
void foo()
{
puts("This is the shared function");
}
The following is the makefile : (Makefile)
.PHONY : all clean
all : libfoo.c
gcc -g -Wall -shared -fPIC -Wl,-soname,$(basename $^).so.1 -o $(basename $^).so.1.0.0 $^; \
#the correct compiling command should be :
#gcc -g -Wall -shared -fPIC -pie -Wl,--export-dynamic,-soname,$(basename $^).so.1 -o $(basename $^).so.1.0.0 $^;
sudo ldconfig $(CURDIR); #this will set up soname link \
ln -s $(basename $^).so.1.0.0 $(basename $^).so #this will set up linker name link;
clean :
-rm libfoo.s*; sudo ldconfig;#roll back
When I ran it ./libfoo.so, I got segmentation fault, and this was because I compiled the runnable shared library in a wrong way. But I wanted to know exactly what was causing the segmentation fault. So I used gdb libfoo.so.1.0.0 corefile, then bt and got the following:
[xhan#localhost Desktop]$ gdb ./libfoo.so core.8326
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/xiaohan/Desktop/libfoo.so.1.0.0...done.
warning: core file may not match specified executable file.
[New LWP 8326]
Core was generated by `./libfoo.so'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000000001 in ?? ()
(gdb) bt
#0 0x0000000000000001 in ?? ()
#1 0x00007ffd29cd13b4 in ?? ()
#2 0x0000000000000000 in ?? ()
(gdb) quit
But I still don't know what caused the segmentation fault. Debugging the core file can not give me any clue that the cause of my segmentation fault is that I used a wrong compiling command.
Can anyone help me with debugging this? Or can anyone tell me the scope of the bugs that is impossible to debug even using gdb and core file? Answers that respond to only one question will also be accepted.
Thanks!
IMPORTANT ASSUMPTIONS I AM HOLDING:
Some may ask why I want to make a shared library runnable. I do this because I want to compile a shared library what is similar to /lib64/ld-2.17.so.
Of course you can't rely on gdb telling you the cause of every bugs you have made. For example, if you simply chmod +x nonexecutable and run it, then get a bug(usually this will not dump core file), and try to debug it with gdb, that is somewhat "crazy". However, once an "executable" can be loaded and dumps a core file during runtime, you can use gdb to debug it, and furthermore, FIND CLUES about why the program goes wrong. However, in my problem ./libfoo.so, I am totally lost.
the scope of the bugs that gdb and core files can help you to debug is limited.
Correct: there are several large classes of bugs for which core dump provides little help. The most common (in my experience) are:
Issues that happen at process startup (such as the example you showed).
GDB needs cooperation with the dynamic loader to tell GDB where various ELF images are mmaped in the process space.
When the crash happens in the dynamic loader itself, or before the dynamic loader had a chance to tell GDB where things are, you end up with a very confusing picture.
Various heap corruption bugs.
Usually you can tell that it's likely that heap corruption is the problem (e.g. any crash inside malloc or free is usually a sign of one), but that tells you very little about the root cause of the problem.
Fortunately, tools like Valgrind and Address Sanitizer can often point you straight at the problem.
Various stack overflow bugs.
GDB uses contents of current stack to tell you how you got to the function you are in (backtrace).
But if you overwrite stack memory with garbage, then the record of how you got to where you are is lost. And if you corrupt stack, and then use "grbage" function pointer, then you can end up with a core dump from which you can't tell either where you are, or how you got there.
Various "logical" bugs.
For example, suppose you have a tree data structure, and a recursive procedure to visit its nodes. If your tree is not a proper tree, and has a cycle in it, your visit procedure will run out of stack and crash.
But looking at the crash tells you nothing about where the tree ceased to be a tree and turned into a graph.
Data races.
You may be iterating over elements of std::vector and crash. Examining the vector shows you that it is no longer in valid state.
That often happens when some other thread modifies the vector (or any other data structure) from under you.
Again, the crash stack trace tells you very little where the bug actually is.

gdb segmentation fault line number missing with c++11 option [duplicate]

Is there any gcc option I can set that will give me the line number of the segmentation fault?
I know I can:
Debug line by line
Put printfs in the code to narrow down.
Edits:
bt / where on gdb give No stack.
Helpful suggestion
I don't know of a gcc option, but you should be able to run the application with gdb and then when it crashes, type where to take a look at the stack when it exited, which should get you close.
$ gdb blah
(gdb) run
(gdb) where
Edit for completeness:
You should also make sure to build the application with debug flags on using the -g gcc option to include line numbers in the executable.
Another option is to use the bt (backtrace) command.
Here's a complete shell/gdb session
$ gcc -ggdb myproj.c
$ gdb a.out
gdb> run --some-option=foo --other-option=bar
(gdb will say your program hit a segfault)
gdb> bt
(gdb prints a stack trace)
gdb> q
[are you sure, your program is still running]? y
$ emacs myproj.c # heh, I know what the error is now...
Happy hacking :-)
You can get gcc to print you a stacktrace when your program gets a SEGV signal, similar to how Java and other friendlier languages handle null pointer exceptions. See my answer here for more details:
how to generate a stacktace when my C++ app crashes ( using gcc compiler )
The nice thing about this is you can just leave it in your code; you don't need to run things through gdb to get the nice debug output.
If you compile with -g and follow the instructions there, you can use a command-line tool like addr2line to get file/line information from the output.
Run it under valgrind.
you also need to build with debug flags on -g
You can also open the core dump with gdb (you need -g though).
If all the preceding suggestions to compile with debugging (-g) and run under a debugger (gdb, run, bt) are not working for you, then:
Elementary: Maybe you're not running under the debugger, you're just trying to analyze the postmortem core dump. (If you start a debug session, but don't run the program, or if it exits, then when you ask for a backtrace, gdb will say "No stack" -- because there's no running program at all. Don't forget to type "run".) If it segfaulted, don't forget to add the third argument (core) when you run gdb, otherwise you start in the same state, not attached to any particular process or memory image.
Difficult: If your program is/was really running but your gdb is saying "No stack" perhaps your stack pointer is badly smashed. In which case, you may be a buffer overflow problem somewhere, severe enough to mash your runtime state entirely. GCC 4.1 supports the ProPolice "Stack Smashing Protector" that is enabled with -fstack-protector-all. It can be added to GCC 3.x with a patch.
There is no method for GCC to provide this information, you'll have to rely on an external program like GDB.
GDB can give you the line where a crash occurred with the "bt" (short for "backtrace") command after the program has seg faulted. This will give you not only the line of the crash, but the whole stack of the program (so you can see what called the function where the crash happened).
The No stack problem seems to happen when the program exit successfully.
For the record, I had this problem because I had forgotten a return in my code, which made my program exit with failure code.

How should I go about debugging a SIGFPE in a large, unfamiliar software project?

I'm trying to get to the bottom of a bug in KDE 5.6. The locker screen breaks no matter how I lock it. Here's the relevant code: https://github.com/KDE/kscreenlocker/blob/master/abstractlocker.cpp#L51
When I run /usr/lib/kscreenlocker_greet --testing, I get an output of:
KCrash: Application 'kscreenlocker_greet' crashing...
Floating point exception (core dumped)
I'm trying to run it with gdb to try and pin the exact location of the bug, but I'm not sure where to set the breakpoints in order to isolate the bug. Should I be looking for calls to KCrash? Or perhaps a raise() call? Can I get gdb to print off the relevant line of code that causes SIGFPE?
Thanks for any advice you can offer.
but I'm not sure where to set the breakpoints in order to isolate the bug
You shouldn't need to set any breakpoints at all: when a process running under GDB encounters a fatal signal (such as SIGFPE), the OS notices that the process is being traced by the debugger, and notifies the debugger (instead of terminating the process). That in turn causes GDB to stop, and prompt you for additional commands. It is at that time that you can look around and understand what caused the crash.
Example:
cat -n t.c
1 #include <fenv.h>
2
3 int foo(double d) {
4 return 1/d;
5 }
6
7 int main()
8 {
9 feenableexcept(FE_DIVBYZERO);
10 return foo(0);
11 }
gcc -g t.c -lm
./a.out
Floating point exception
gdb -q ./a.out
(gdb) run
Starting program: /tmp/a.out
Program received signal SIGFPE, Arithmetic exception.
0x000000000040060e in foo (d=0) at t.c:4
4 return 1/d;
(gdb) bt
#0 0x000000000040060e in foo (d=0) at t.c:4
#1 0x0000000000400635 in main () at t.c:10
(gdb) q
Here, as you can see, GDB stops when SIGFPE is delivered, and allows you to look around and understand the crash.
In your case, you would want to first install debuginfo symbols for KDE, and then run
gdb --args /usr/lib/kscreenlocker_greet --testing
(gdb) run

First experiments with buffer overflow

I've started reading about buffer overflow and how hackers use it to execute custom code instead of the regular compiled one and now I'm trying to reproduce some basic situations, with a vurnerable function that copy data into a char array with the unsafe strcpy.
The point is that when I change the return address with one of an assembly instrution of a function defined in the program it works fine, while when I inject code directly in bytes it returned SEGMENTATION FAULT.
I'm using the Kali distribution x64 v3.18
I've disabled the address space layout randomization (ASLR):
echo 0 > /proc/sys/kernel/randomize_va_space
And disabled the stack protection code added by the compiler:
gcc -g -fno-stack-protector exbof.c -o exbof
Code:
#include <stdlib.h>
#include <string.h>
int main(int argc, char **argv){
char buffer[500] = {0};
strcpy(buffer, argv[1]);
return 0;
}
Usage:
./exbof `perl -e 'print "x90"x216; // nop sled
print CUSTOM_CODE; // my code
print "xff"x(500 - 216 - CODE_LENGTH); // fill empty space
print "xff"xOFFSET // distance between the last byte
// of buffer and the return address
printf("\\x%lx", BUFFER_ADDRESS + int(rand(26)) * 8);'`
Output:
Segmentation Fault
In GDB:
Program received signal SIGSEGV, Segmentation fault.
0x00007fffffffxyzt in ?? ()
I've used GDB to debug it and the code write the new address corrrectly in the stack.
I'm using a shellcode exec found online, but I've also tried to inject a piece of code in bytes from my program and when I checked with GDB the assembly code injected turned out to be valid code and exactly the same of the original one.
It seems to me that any address out of the .text memory segment doesn't work.
Suggestions?
Solution:
As suggested by #andars, it's necessary to set up the flag that mark the stack as executable.
So, if you want to try this and start playing with buffer overflows, you have to:
disable the address space layout randomization (ASLR):
echo 0 > /proc/sys/kernel/randomize_va_space
disable the stack protection code added by the compiler:
gcc -g -fno-stack-protector your_program.c -o your_program
set up a flag in the program header to mark the stack as executable:
execstack -s your_program
or you can do it directly at assembly time or at link time:
gcc -g -fno-stack-protector -z execstack your_program.c -o your_program

64-bit program segmentation fault on HP-UX PA RISC

I am using 3 HP-UX PA RISC machines for testing. My binary is failing on one PA RISC machine where as others it working. Note that, even though binary is executed with version check i.e. it just print version and exit and don't perform any other operation , still binary is giving segmentation fault. what could be probable reason for Segmentation fault. It is important to me to find out root cause of the failure on one box. As program is working on 2 HP-UX, it seems that it is environment issue?
I tried to copy same piece of code (i.e. declare variables, print version and exit) in test program and build with same compilation options but it is working. Here is gdb output for the program.
$ gdb prg_us
Detected 64-bit executable.
Invoking /opt/langtools/bin/gdb64.
HP gdb 5.4.0 for PA-RISC 2.0 (wide), HP-UX 11.00
and target hppa2.0w-hp-hpux11.00.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 5.4.0 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
(gdb) b 5573
Breakpoint 1 at 0x4000000000259e04: file pmgreader.c, line 5573 from /tmp/test/prg_us.
(gdb) r -v
Starting program: /tmp/test/prg_us -v
Breakpoint 1, main (argc=2, argv=0x800003ffbfff05f8) at pmgreader.c:5573
5573 if (argc ==2 && strcmp (argv[1], "-v") == 0)
Current language: auto; currently c++
(gdb) n
5575 printf ("%s", VER);
(gdb) n
5576 exit(0);
(gdb) n
Program received signal SIGSEGV, Segmentation fault
si_code: 0 - SEGV_UNKNOWN - Unknown Error.
0x800003ffbfb9e130 in real_free+0x480 () from /lib/pa20_64/libc.2
(gdb)
What should be probable cause? why it is working on one and not on another?
Just a long shot - are you including both stdio.h and stdlib.h so the prototypes for printf() and exit() are known to the compiler?
Actually, after a bit more thought (and noticing that C++ is in the mix), you may have some static object initialization causing problems (possibly corrupting the heap?).
Unfortunately, it looks like valgrind is not supported on PA-RISC - is there some similar tool on PA-RISC you can run? If not, it might be worthwhile running valgrind on an x64 build of your program if it's not too difficult to set that up.
Michael Burr already hinted at the problem: it's a global object.
Notice that the crash is from a free function. That indicates a memory deallocation, and in turn a destructor. This makes sense given the context: global destructors run after exit(0). A stack trace will show more detail.