Force gdb to use provided thread lib - gdb

I have an embedded ARM application which is bundled with all the so-libraries stripped, including the libpthread.so. Sometimes the application gets stuck in some part of code and I want to be able to attach to it with gdb and see what's going on. The problem is that gdb refuses to load the needed threading support library, with the following messages:
Trying host libthread_db library: /home/me/debug_libs/libthread_db.so.1.
td_ta_new failed: application not linked with libthread
thread_db_load_search returning 0
warning: Unable to find libthread_db matching inferior's thread
library, thread debugging will not be available.
Because of this I cannot debug the application, e.g. I cannot see current call stacks for all threads.
After some investigation I suspect that the td_ta_new failing with the application not linked with libthread is caused by the stripped version of libpthread, which lacks the nptl_version symbol. Is there any way to bypass the error?
The gdb is compiled for ARM and being run on the device itself. I have unstripped versions of the libraries, but the application is already running with the stripped libraries.

Is there any way to bypass the error?
A few ways that come to mind:
Use add-symbol-file to override the stripped libpthread.so.0 with un-stripped one:
(gdb) info shared libpthread.so
# shows the path and memory address where stripped libpthread.so.0 is loaded
(gdb) add-symbol-file /path/to/unstripped/libpthread.so.0 $address
# should override with new symbols, and attempt to re-load libthread_db.so.1
Run gdb -ex 'set sysroot /path/to/unstripped' ... where /path/to/unstripped is the path that mirrors installed tree (that is, if you are using /lib/libpthread.so.0, there should be /path/to/unstripped/lib/libpthread.so.0.
I have not tested this, but I believe it should work.
You could comment out the version check in GDB and rebuild it.

Related

gdb: how to learn which shared library loaded a shared library in question

I need to get the list of shared libraries used by an app in runtime. Most of them can be listed by ldd, but some can be seen only with gdb -p <pid> and by running the gdb command info sharedlib.
It would really help, if I could learn in some way: for a chosen library (in the list, output by info sharedlib), which library (of the same list) had caused to load it. Is there any way to learn it in gdb or in some other way? Because sometimes, I see a loaded library in the list and cannot get why it is there and which (probably, previously loaded in memory) library loaded it.
UPDATE:
I attach a screen shot of gdb showing the info that I need. I used breakpoint on dlopen, as it was suggested in comments and in the answer. The command x/s $rdi prints out the first argument of dlopen, as by Linux System V ABI, i.e. it prints the name if the library, about which I want to learn who loaded it (libdebuginfod.so.1). I put it here just for those who are curious. In my case, it can be seen, that the libdebuginfod.so.1 was loaded by libdw.so.1 (as shown by bt command).
Is there any way to learn it in gdb or in some other way?
There are a few ways.
You can run the program with env LD_DEBUG=files /path/to/exe.
This will produce output similar to:
LD_DEBUG=files /bin/date
76042:
76042: file=libc.so.6 [0]; needed by /bin/date [0]
It is the needed by part that you most care about.
You could also use GDB and use set set stop-on-solib-events 1. This will produce output similar to:
Stopped due to shared library event:
Inferior loaded /lib/x86_64-linux-gnu/libc.so.6
At that point, you could execute where command and observe which dlopen() call caused the new library to be loaded.
You could also set a breakpoint on dlopen() and do the same.
The stop-on-solib-events may be better if your executable repeatedly dlopen()s the same library -- the library set will not change when you dlopen() the same library again, and you'll stop when you don't care to stop. Setting stop-on-solib-events avoids such unnecessary stops.

Analyze Linux core dump on different machine: threads and shared libs [duplicate]

We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.
I have a few questions:
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.
Remote debugging is strictly not allowed.
Thanks
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.
The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.
But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.
You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.
You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.
cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...
Then, on your system:
mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core # Note: very important to set solib-... before loading core
(gdb) where # Get meaningful stack trace!
We then advice the Customer to run a -g binary so it becomes easier to debug.
A much better approach is:
build with -g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
distribute myexe to customers
when a customer gets a core, use myexe.dbg to debug it
You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.
You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.
When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?
Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so it happens in.
There are a number of books about debugging, but I can't think of one I'd recommend.
As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.
Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...
Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !
Copying the resolution from my question which was considered a duplicate of this.
set solib-absolute-prefix from the accepted solution did not help for me. set sysroot was absolutely necessary to make gdb load locally provided libs.
Here is the list of commands I used to open core dump:
# note: all the .so files obtained from user machine must be put into local directory.
#
# most importantly, the following files are necessary:
# 1. libthread_db.so.1 and libpthread.so.0: required for thread debugging.
# 2. other .so files are required if they occur in call stack.
#
# these files must also be renamed exactly as the symlinks
# i.e. libpthread-2.28.so should be renamed to libpthread.so.0
# load executable file
file ./thedarkmod.x64
# force gdb to forget about local system!
# load all .so files using local directory as root
set sysroot .
# drop dump-recorded paths to .so files
# i.e. load ./libpthread.so.0 instead of ./lib/x86_64-linux-gnu/libpthread.so.0
set solib-search-path .
# disable damn security protection
set auto-load safe-path /
# load core dump file
core core.6487
# print stacktrace
bt

Why is gdb refusing to load my shared objects and what is the validation operation

Main question:
In Ubuntu trying to debug an embedded application running in QNX, I am getting the following error message from gdb:
warning: Shared object "$SOLIB_PATH/libc.so.4" could not be validated and will be ignored.,
Q: What is the "validation" operation going on ?
After some research I found that the information reported by readelf -n libfoo.so contains a build-id and that this is compared against something and there could be a mismatch causing gdb to refuse to load the library. If that's the case what ELF file's build-id is the shared object's build-id compared against ? Can I find this information parsing the executable file ?
More context:
I have a .core file for this executable. I am using a version of gdb provided by QNX and making sure I use set sysroot and set solib-search-path to where I installed the QNX toolchain.
My full command to launch gdb in Ubuntu is :
$QNX_TOOLCHAIN_PATH/ntox86_64-gdb --init-eval-command 'set sysroot $SYSROOT_PATH' --init-eval-command 'set solib-search-path $SOLIB_PATH --init-eval-command 'python sys.path.append("/usr/share/gcc-8/python");' -c path-to-exe.core path-to-executable-bin
Gdb is complaining that it cannot load shared objects :
warning: Shared object "$SOLIB_PATH/libc.so.4" could not be validated and will be ignored.
The big thing here is to make sure you're using the exact same binary that is on the target (that the program runs over). This is often quite difficult with libc, especially because libc/ldqnx are sometimes "the same thing" and it confuses gdb.
The easiest way to do this is to log your mkifs output (on the linux host):
make 2>&1 | tee build-out.txt
and read through that, search for libc.so.4, and copy the binary that's being pulled onto the target into . (wherever you're running gdb) so you don't need to mess with SOLIB paths (the lazy solution).
Alternatively, scp/ftp a new libc (one that you want to use, and ideally one that you have associated symbols for) into /tmp and use LD_LIBRARY_PATH to pull that one (and DL_DEBUG=libs to confirm, if you need). Use that same libc to debug
source: I work at QNX and even we struggle with gdb + libc sometimes

Need GLIBC debug information from rpmbuild of updated source

I'm working on RHEL WS 4.5.
I've obtained the glibc source rpm matching this system, opened it to get its contents using rpm2cpio.
Working in that tree, I've created a patch to mtrace.c (i want to add more stack backtrace levels) and incorporated it in the spec file and created a new set of RPMs including the debuginfo rpms.
I installed all of these on a test vm (created from the same RH base image) and can confirm that my changes are included.
But with more complex executions, I crash in mtrace.c ... but gdb can't find the debug information so I don't get line number info and I can't actually debug the failure.
Based on dates, I think I can confirm that the debug information is installed on the test system in /usr/src/debug/glibc-2.3.6/
I tried
sharedlibrary libc*
in gdb and it tells me the symbols are already loaded.
My test includes a locally built python and full symbols are found for python.
My sense is that perhaps glibc isn't being built under rpmbuild with debug enabled. I've reviewed the glibc.spec file and even built with
_enable_debug_packages
defined as 1 which looked like it might influence the result. My review of the configure scripts invoked during the rpmbuild build step didn't give me any hints.
Hmmmm .. just found /usr/lib/debug/lib/libc-2.3.4.so.debug
and /usr/lib/debug/lib/tls/i486/libc-2.3.4.so.debug
but both of these are reported as stripped by the file command.
It appears that you are installing non-matching RPMs:
/usr/src/debug/glibc-2.3.6
just found /usr/lib/debug/lib/libc-2.3.4.so.debug
There are not for the same version; there is no way they came from the same -debuginfo RPM.
both of these are reported as stripped by the file command.
These should not show as stripped. Either they were not built correctly, or your strip is busted.
Also note that you don't actually have to get all of this working to debug your problem. In the RPMBUILD directory, you should be able to find the glibc build directory, with full-debug libc.so.6. Just copy that library into your VM, and you wouldn't have to worry about the debuginfo RPM.
Try verifying that debug info for mtrace.c is indeed present. First see if the separate debug info for GLIBC knows about a compilation unit called mtrace.c:
$ eu-readelf -w /usr/lib/debug/lib64/libc-2.15.so.debug > t
$ grep mtrace t
name (strp) "mtrace.c"
name (strp) "mtrace"
1 0 0 0 mtrace.c
[10480] "mtrace.c"
[104bb] "mtrace"
[5052] symbol: mtrace, CUs: 446
Then see if GDB actually finds the source file from the glibc-debuginfo RPM:
(gdb) set pagination off
(gdb) start # pause your test program right after main()
(gdb) set logging on
Copying output to gdb.txt.
(gdb) info sources
Quit GDB then grep for mtrace in gdb.txt and you should find something like /usr/src/debug/glibc-2.15-a316c1f/malloc/mtrace.c
This works with GDB 7.4. I'm not sure the GDB version shipped with RHEL 4.5 supports all the command used above. Building upstream GDB from source is in fact easier than Python though.
When trying to add strack traces to mtrace, make sure you don't call malloc() directly or indirectly in the GLIBC malloc hooks.

gdb: Cannot find new threads: generic error after system update

I am running an OpenEmbedded based Linux on an ARM board, where my application is running. I used to run kernel 2.6.35, gdb 6.8 and gcc 4.3. Lately I've updated the system to kernel 2.6.37, gdb 7.4 (also tried 7.3) and gcc 4.6.
Now, my application can not be debugged anymore (on the ARM board), everytime I try to run it in gdb I get the error "gdb: Cannot find new threads: generic error". The application makes use of pthreads and does link against pthreads (readelf lists libpthread.so.0 as a dependency). The suggested solutions I found so far all recommend linking to pthread which I am already doing. The other recommendation I found was to use LD_PRELOAD=/lib/libpthread.so.0 which does not make any difference for me.
Debugging the x86 builds of the application works without a problem.
EDIT: To answer the questions posed in the first answer, I am using gdb on the target (ARM), i.e. no cross-gdb. I also have not stripped libpthread.so.0 (/lib/libpthread-2.9.so: ELF 32-bit LSB shared object, ARM, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.16, not stripped). glibc remained at version 2.9, and the update involved recompiling the whole linux image
EDIT2: Removing /lib/libthread-db* allows debugging (with consequent warnings and obviously some features will not work)
EDIT3: Using set debug libthread-db 1 I get:
Starting program: /home/root/app
Trying host libthread_db library: libthread_db.so.1.
Host libthread_db.so.1 resolved to: /lib/libthread_db.so.1.
td_ta_new failed: application not linked with libthread
thread_db_load_search returning 0
Trying host libthread_db library: libthread_db.so.1.
Host libthread_db.so.1 resolved to: /lib/libthread_db.so.1.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
warning: Unable to set global thread event mask: generic error
Warning: find_new_threads_once: find_new_threads_callback: cannot get thread info: generic error
Found 0 new threads in iteration 0.
Warning: find_new_threads_once: find_new_threads_callback: cannot get thread info: generic error
Found 0 new threads in iteration 1.
Warning: find_new_threads_once: find_new_threads_callback: cannot get thread info: generic error
Found 0 new threads in iteration 2.
Warning: find_new_threads_once: find_new_threads_callback: cannot get thread info: generic error
Found 0 new threads in iteration 3.
thread_db_load_search returning 1
Warning: find_new_threads_once: find_new_threads_callback: cannot get thread info: generic error
Found 0 new threads in iteration 0.
Cannot find new threads: generic error
(gdb) Write failed: Broken pipe
There are two common causes of this error:
You have a mis-match between libpthread.so.0 and libthread_db.so.1
You have stripped libpthread.so.0
Your message isn't entirely clear:
are you using cross GDB to debug the application running on ARM from an x86 host?
have you updated (or rebuilt) glibc in addition to updating the kernel, etc.
If you stripped libpthread.so.0, then don't do that -- libthread_db needs it to not be stripped.
If you are cross-debugging, make sure to rebuild libthread_db.so.1 on host to match glibc on target.
Update:
not cross-debugging
did not strip libpthread
So, something in your GDB or glibc appears to have been broken. You can try to see what that is by
Putting removed libthread_db back, and
(gdb) set debug libthread-db 1
(gdb) run
Update 2:
warning: Unable to set global thread event mask: generic error
This means that GDB was able to look up td_ta_set_event function in libthread_db, and called it, but the function returned an error. One way this could happen is if GDB was unable to find __nptl_threads_events function in libpthread.so.0. What does this command produce:
nm /lib/libpthread.so.0 | grep __nptl_threads_events
If that command produces output, e.g.:
000000000021c294 b __nptl_threads_events
then I am not sure what else is failing. You'll likely have to debug GDB itself to figure out what's happening.
If on the other hand the grep above produces no output, then it's a problem with your toolchain: you'll have to figure out why that variable doesn't appear in your rebuilt libpthread.so.0.