gdb: how to learn which shared library loaded a shared library in question - c++

I need to get the list of shared libraries used by an app in runtime. Most of them can be listed by ldd, but some can be seen only with gdb -p <pid> and by running the gdb command info sharedlib.
It would really help, if I could learn in some way: for a chosen library (in the list, output by info sharedlib), which library (of the same list) had caused to load it. Is there any way to learn it in gdb or in some other way? Because sometimes, I see a loaded library in the list and cannot get why it is there and which (probably, previously loaded in memory) library loaded it.
UPDATE:
I attach a screen shot of gdb showing the info that I need. I used breakpoint on dlopen, as it was suggested in comments and in the answer. The command x/s $rdi prints out the first argument of dlopen, as by Linux System V ABI, i.e. it prints the name if the library, about which I want to learn who loaded it (libdebuginfod.so.1). I put it here just for those who are curious. In my case, it can be seen, that the libdebuginfod.so.1 was loaded by libdw.so.1 (as shown by bt command).

Is there any way to learn it in gdb or in some other way?
There are a few ways.
You can run the program with env LD_DEBUG=files /path/to/exe.
This will produce output similar to:
LD_DEBUG=files /bin/date
76042:
76042: file=libc.so.6 [0]; needed by /bin/date [0]
It is the needed by part that you most care about.
You could also use GDB and use set set stop-on-solib-events 1. This will produce output similar to:
Stopped due to shared library event:
Inferior loaded /lib/x86_64-linux-gnu/libc.so.6
At that point, you could execute where command and observe which dlopen() call caused the new library to be loaded.
You could also set a breakpoint on dlopen() and do the same.
The stop-on-solib-events may be better if your executable repeatedly dlopen()s the same library -- the library set will not change when you dlopen() the same library again, and you'll stop when you don't care to stop. Setting stop-on-solib-events avoids such unnecessary stops.

Related

Not able to load core dump fully in gdb [duplicate]

We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.
I have a few questions:
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.
Remote debugging is strictly not allowed.
Thanks
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.
The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.
But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.
You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.
You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.
cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...
Then, on your system:
mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core # Note: very important to set solib-... before loading core
(gdb) where # Get meaningful stack trace!
We then advice the Customer to run a -g binary so it becomes easier to debug.
A much better approach is:
build with -g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
distribute myexe to customers
when a customer gets a core, use myexe.dbg to debug it
You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.
You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.
When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?
Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so it happens in.
There are a number of books about debugging, but I can't think of one I'd recommend.
As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.
Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...
Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !
Copying the resolution from my question which was considered a duplicate of this.
set solib-absolute-prefix from the accepted solution did not help for me. set sysroot was absolutely necessary to make gdb load locally provided libs.
Here is the list of commands I used to open core dump:
# note: all the .so files obtained from user machine must be put into local directory.
#
# most importantly, the following files are necessary:
# 1. libthread_db.so.1 and libpthread.so.0: required for thread debugging.
# 2. other .so files are required if they occur in call stack.
#
# these files must also be renamed exactly as the symlinks
# i.e. libpthread-2.28.so should be renamed to libpthread.so.0
# load executable file
file ./thedarkmod.x64
# force gdb to forget about local system!
# load all .so files using local directory as root
set sysroot .
# drop dump-recorded paths to .so files
# i.e. load ./libpthread.so.0 instead of ./lib/x86_64-linux-gnu/libpthread.so.0
set solib-search-path .
# disable damn security protection
set auto-load safe-path /
# load core dump file
core core.6487
# print stacktrace
bt

Why is gdb refusing to load my shared objects and what is the validation operation

Main question:
In Ubuntu trying to debug an embedded application running in QNX, I am getting the following error message from gdb:
warning: Shared object "$SOLIB_PATH/libc.so.4" could not be validated and will be ignored.,
Q: What is the "validation" operation going on ?
After some research I found that the information reported by readelf -n libfoo.so contains a build-id and that this is compared against something and there could be a mismatch causing gdb to refuse to load the library. If that's the case what ELF file's build-id is the shared object's build-id compared against ? Can I find this information parsing the executable file ?
More context:
I have a .core file for this executable. I am using a version of gdb provided by QNX and making sure I use set sysroot and set solib-search-path to where I installed the QNX toolchain.
My full command to launch gdb in Ubuntu is :
$QNX_TOOLCHAIN_PATH/ntox86_64-gdb --init-eval-command 'set sysroot $SYSROOT_PATH' --init-eval-command 'set solib-search-path $SOLIB_PATH --init-eval-command 'python sys.path.append("/usr/share/gcc-8/python");' -c path-to-exe.core path-to-executable-bin
Gdb is complaining that it cannot load shared objects :
warning: Shared object "$SOLIB_PATH/libc.so.4" could not be validated and will be ignored.
The big thing here is to make sure you're using the exact same binary that is on the target (that the program runs over). This is often quite difficult with libc, especially because libc/ldqnx are sometimes "the same thing" and it confuses gdb.
The easiest way to do this is to log your mkifs output (on the linux host):
make 2>&1 | tee build-out.txt
and read through that, search for libc.so.4, and copy the binary that's being pulled onto the target into . (wherever you're running gdb) so you don't need to mess with SOLIB paths (the lazy solution).
Alternatively, scp/ftp a new libc (one that you want to use, and ideally one that you have associated symbols for) into /tmp and use LD_LIBRARY_PATH to pull that one (and DL_DEBUG=libs to confirm, if you need). Use that same libc to debug
source: I work at QNX and even we struggle with gdb + libc sometimes

GDB: Get useful stack trace after uncaught C++ exception [duplicate]

We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.
I have a few questions:
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.
Remote debugging is strictly not allowed.
Thanks
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.
The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.
But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.
You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.
You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.
cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...
Then, on your system:
mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core # Note: very important to set solib-... before loading core
(gdb) where # Get meaningful stack trace!
We then advice the Customer to run a -g binary so it becomes easier to debug.
A much better approach is:
build with -g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
distribute myexe to customers
when a customer gets a core, use myexe.dbg to debug it
You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.
You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.
When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?
Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so it happens in.
There are a number of books about debugging, but I can't think of one I'd recommend.
As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.
Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...
Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !
Copying the resolution from my question which was considered a duplicate of this.
set solib-absolute-prefix from the accepted solution did not help for me. set sysroot was absolutely necessary to make gdb load locally provided libs.
Here is the list of commands I used to open core dump:
# note: all the .so files obtained from user machine must be put into local directory.
#
# most importantly, the following files are necessary:
# 1. libthread_db.so.1 and libpthread.so.0: required for thread debugging.
# 2. other .so files are required if they occur in call stack.
#
# these files must also be renamed exactly as the symlinks
# i.e. libpthread-2.28.so should be renamed to libpthread.so.0
# load executable file
file ./thedarkmod.x64
# force gdb to forget about local system!
# load all .so files using local directory as root
set sysroot .
# drop dump-recorded paths to .so files
# i.e. load ./libpthread.so.0 instead of ./lib/x86_64-linux-gnu/libpthread.so.0
set solib-search-path .
# disable damn security protection
set auto-load safe-path /
# load core dump file
core core.6487
# print stacktrace
bt

Force gdb to use provided thread lib

I have an embedded ARM application which is bundled with all the so-libraries stripped, including the libpthread.so. Sometimes the application gets stuck in some part of code and I want to be able to attach to it with gdb and see what's going on. The problem is that gdb refuses to load the needed threading support library, with the following messages:
Trying host libthread_db library: /home/me/debug_libs/libthread_db.so.1.
td_ta_new failed: application not linked with libthread
thread_db_load_search returning 0
warning: Unable to find libthread_db matching inferior's thread
library, thread debugging will not be available.
Because of this I cannot debug the application, e.g. I cannot see current call stacks for all threads.
After some investigation I suspect that the td_ta_new failing with the application not linked with libthread is caused by the stripped version of libpthread, which lacks the nptl_version symbol. Is there any way to bypass the error?
The gdb is compiled for ARM and being run on the device itself. I have unstripped versions of the libraries, but the application is already running with the stripped libraries.
Is there any way to bypass the error?
A few ways that come to mind:
Use add-symbol-file to override the stripped libpthread.so.0 with un-stripped one:
(gdb) info shared libpthread.so
# shows the path and memory address where stripped libpthread.so.0 is loaded
(gdb) add-symbol-file /path/to/unstripped/libpthread.so.0 $address
# should override with new symbols, and attempt to re-load libthread_db.so.1
Run gdb -ex 'set sysroot /path/to/unstripped' ... where /path/to/unstripped is the path that mirrors installed tree (that is, if you are using /lib/libpthread.so.0, there should be /path/to/unstripped/lib/libpthread.so.0.
I have not tested this, but I believe it should work.
You could comment out the version check in GDB and rebuild it.

gdb7.7 does not load shared libraries

Perhaps, i misunderstand something, but i can't make Gdb to read debug libraries.
what i do from command line is:
gdb
file problem_exec
b main
r
GDB stops at:
(gdb) r
Starting program: /Users/.../problem_exec
Breakpoint 1, main (argc=<error reading variable: Could not find the frame base for "main(int, char**)".>, argv=<error reading variable: Could not find the frame base for "main(int, char**)".>)
No shared libraries:
(gdb) info shared
No shared libraries loaded at this time.
The last command gives: "No shared libraries loaded at this time."
My .gdbinit looks like:
# file .gdbinit
set stop-on-solib-events 1
# stop gdb from stepping over functions and output diagnostics
set step-mode on
set breakpoint pending on
set env DYLD_LIBRARY_PATH path1:path2:path3
#automatically load shared libraries (on/off):
set auto-solib-add on
I am sure that an executable is linked against debug version of the shared libraries (in total there are some 30 libraries, but the one I am interested in is definitely compiled in a debug mode). I checked it with otool -L problem_exec
If I run program, it goes until a run-time error in the library I want to debug, but i can not step in.
Am I missing something?
p.s. I run self-compiled version of Gdb on os-x.
UPDATE: it could be related to this problem.
set stop-on-solib-events 1
With that setting, GDB should stop before any shared libraries are loaded.
When you do this:
file problem_exec
b main
r
... where is GDB stopped?
info shared
If GDB is stopped at main, then shared libraries should be loaded. But if it is stopped in the dynamic loader (as I expect it is), then they should not be loaded yet.