So on Linux when a C++ program that was compiled/linked with gcc, has its executable loaded the following happens:
exec* syscall
LD dynamic libraries loaded
C++ static initialization
entry point of main
Suppose I have some function with prototype void f(),
Is there some way (via source modification, attributes, compiler/linker options, etc) to link the executable with f such that it will be executed between step 1 and 2 ?
What about between step 2 and 3 ?
(Clearly there is no standard way to do this, I am asking for a platform-specifc, compiler-specific way for a recent versions of gcc/linux/x86_64/glibc/binutils)
Yes, you could do this between (1) and (2), or between (2) and (3). Step 2 "ld dynamic libraries loaded" is actually done by calling the dynamic linker, ld.so. Typically, this would be /lib64/ld-linux-x86-64.so.2 or similar; its part of glibc. However, the path is actually specified in the executable, so you can use any path you want.
$ readelf -l `which bash`
⋮
Program Headers:
⋮
INTERP 0x0000000000000238 0x0000000000400238 0x0000000000400238
0x000000000000001c 0x000000000000001c R 1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
⋮
This is in addition to doing things like LD_PRELOAD/LD_AUDIT.
For between (2) and (3), it sounds like you just want to change the entrypoint address.
Basically no. The execve() system call will wipe the environment. There is no way for anything you do in your address space to survive into the new one. You are allowed to send file descriptors (except ones flagged CLOEXEC, of course) into the new process, and you can pass it arguments via the environment.
... which gives you something that might do what you want. You can set LD_PRELOAD to load a shared library "first", before any code is run from your target executable and before any symbols are resolved by the shared linker. It's not clear from your question whether this meets your requirements or not.
Related
I need to get the list of shared libraries used by an app in runtime. Most of them can be listed by ldd, but some can be seen only with gdb -p <pid> and by running the gdb command info sharedlib.
It would really help, if I could learn in some way: for a chosen library (in the list, output by info sharedlib), which library (of the same list) had caused to load it. Is there any way to learn it in gdb or in some other way? Because sometimes, I see a loaded library in the list and cannot get why it is there and which (probably, previously loaded in memory) library loaded it.
UPDATE:
I attach a screen shot of gdb showing the info that I need. I used breakpoint on dlopen, as it was suggested in comments and in the answer. The command x/s $rdi prints out the first argument of dlopen, as by Linux System V ABI, i.e. it prints the name if the library, about which I want to learn who loaded it (libdebuginfod.so.1). I put it here just for those who are curious. In my case, it can be seen, that the libdebuginfod.so.1 was loaded by libdw.so.1 (as shown by bt command).
Is there any way to learn it in gdb or in some other way?
There are a few ways.
You can run the program with env LD_DEBUG=files /path/to/exe.
This will produce output similar to:
LD_DEBUG=files /bin/date
76042:
76042: file=libc.so.6 [0]; needed by /bin/date [0]
It is the needed by part that you most care about.
You could also use GDB and use set set stop-on-solib-events 1. This will produce output similar to:
Stopped due to shared library event:
Inferior loaded /lib/x86_64-linux-gnu/libc.so.6
At that point, you could execute where command and observe which dlopen() call caused the new library to be loaded.
You could also set a breakpoint on dlopen() and do the same.
The stop-on-solib-events may be better if your executable repeatedly dlopen()s the same library -- the library set will not change when you dlopen() the same library again, and you'll stop when you don't care to stop. Setting stop-on-solib-events avoids such unnecessary stops.
We get core files from running our software on a Customer's box. Unfortunately because we've always compiled with -O2 without debugging symbols this has lead to situations where we could not figure out why it was crashing, we've modified the builds so now they generate -g and -O2 together. We then advice the Customer to run a -g binary so it becomes easier to debug.
I have a few questions:
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
Are there any good books for debugging on Linux, or Solaris? Something example oriented would be great. I am looking for real-life examples of figuring out why a routine crashed and how the author arrived at a solution. Something more on the intermediate to advanced level would be good, as I have been doing this for a while now. Some assembly would be good as well.
Here's an example of a crash that requires us to tell the Customer to get a -g ver. of the binary:
Program terminated with signal 11, Segmentation fault.
#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x00454ff1 in select () from /lib/libc.so.6
...
<omitted frames>
Ideally I'd like to solve find out why exactly the app crashed - I suspect it's memory corruption but I am not 100% sure.
Remote debugging is strictly not allowed.
Thanks
What happens when a core file is generated from a Linux distro other than the one we are running in Dev? Is the stack trace even meaningful?
It the executable is dynamically linked, as yours is, the stack GDB produces will (most likely) not be meaningful.
The reason: GDB knows that your executable crashed by calling something in libc.so.6 at address 0x00454ff1, but it doesn't know what code was at that address. So it looks into your copy of libc.so.6 and discovers that this is in select, so it prints that.
But the chances that 0x00454ff1 is also in select in your customers copy of libc.so.6 are quite small. Most likely the customer had some other procedure at that address, perhaps abort.
You can use disas select, and observe that 0x00454ff1 is either in the middle of instruction, or that the previous instruction is not a CALL. If either of these holds, your stack trace is meaningless.
You can however help yourself: you just need to get a copy of all libraries that are listed in (gdb) info shared from the customer system. Have the customer tar them up with e.g.
cd /
tar cvzf to-you.tar.gz lib/libc.so.6 lib/ld-linux.so.2 ...
Then, on your system:
mkdir /tmp/from-customer
tar xzf to-you.tar.gz -C /tmp/from-customer
gdb /path/to/binary
(gdb) set solib-absolute-prefix /tmp/from-customer
(gdb) core core # Note: very important to set solib-... before loading core
(gdb) where # Get meaningful stack trace!
We then advice the Customer to run a -g binary so it becomes easier to debug.
A much better approach is:
build with -g -O2 -o myexe.dbg
strip -g myexe.dbg -o myexe
distribute myexe to customers
when a customer gets a core, use myexe.dbg to debug it
You'll have full symbolic info (file/line, local variables), without having to ship a special binary to the customer, and without revealing too many details about your sources.
You can indeed get useful information from a crash dump, even one from an optimized compile (although it's what is called, technically, "a major pain in the ass.") a -g compile is indeed better, and yes, you can do so even when the machine on which the dump happened is another distribution. Basically, with one caveat, all the important information is contained in the executable and ends up in the dump.
When you match the core file with the executable, the debugger will be able to tell you where the crash occurred and show you the stack. That in itself should help a lot. You should also find out as much as you can about the situation in which it happens -- can they reproduce it reliably? If so, can you reproduce it?
Now, here's the caveat: the place where the notion of "everything is there" breaks down is with shared object files, .so files. If it is failing because of a problem with those, you won't have the symbol tables you need; you may only be able to see what library .so it happens in.
There are a number of books about debugging, but I can't think of one I'd recommend.
As far as I remember, you dont need to ask your customer to run with the binary built with -g option. What is needed is that you should have a build with -g option. With that you can load the core file and it will show the whole stack trace. I remember few weeks ago, I created core files, with build (-g) and without -g and the size of core was same.
Inspect the values of local variables you see when you walk the stack ? Especially around the select() call. Do this on customer's box, just load the dump and walk the stack...
Also , check the value of FD_SETSIZE on both your DEV and PROD platforms !
Copying the resolution from my question which was considered a duplicate of this.
set solib-absolute-prefix from the accepted solution did not help for me. set sysroot was absolutely necessary to make gdb load locally provided libs.
Here is the list of commands I used to open core dump:
# note: all the .so files obtained from user machine must be put into local directory.
#
# most importantly, the following files are necessary:
# 1. libthread_db.so.1 and libpthread.so.0: required for thread debugging.
# 2. other .so files are required if they occur in call stack.
#
# these files must also be renamed exactly as the symlinks
# i.e. libpthread-2.28.so should be renamed to libpthread.so.0
# load executable file
file ./thedarkmod.x64
# force gdb to forget about local system!
# load all .so files using local directory as root
set sysroot .
# drop dump-recorded paths to .so files
# i.e. load ./libpthread.so.0 instead of ./lib/x86_64-linux-gnu/libpthread.so.0
set solib-search-path .
# disable damn security protection
set auto-load safe-path /
# load core dump file
core core.6487
# print stacktrace
bt
Main question:
In Ubuntu trying to debug an embedded application running in QNX, I am getting the following error message from gdb:
warning: Shared object "$SOLIB_PATH/libc.so.4" could not be validated and will be ignored.,
Q: What is the "validation" operation going on ?
After some research I found that the information reported by readelf -n libfoo.so contains a build-id and that this is compared against something and there could be a mismatch causing gdb to refuse to load the library. If that's the case what ELF file's build-id is the shared object's build-id compared against ? Can I find this information parsing the executable file ?
More context:
I have a .core file for this executable. I am using a version of gdb provided by QNX and making sure I use set sysroot and set solib-search-path to where I installed the QNX toolchain.
My full command to launch gdb in Ubuntu is :
$QNX_TOOLCHAIN_PATH/ntox86_64-gdb --init-eval-command 'set sysroot $SYSROOT_PATH' --init-eval-command 'set solib-search-path $SOLIB_PATH --init-eval-command 'python sys.path.append("/usr/share/gcc-8/python");' -c path-to-exe.core path-to-executable-bin
Gdb is complaining that it cannot load shared objects :
warning: Shared object "$SOLIB_PATH/libc.so.4" could not be validated and will be ignored.
The big thing here is to make sure you're using the exact same binary that is on the target (that the program runs over). This is often quite difficult with libc, especially because libc/ldqnx are sometimes "the same thing" and it confuses gdb.
The easiest way to do this is to log your mkifs output (on the linux host):
make 2>&1 | tee build-out.txt
and read through that, search for libc.so.4, and copy the binary that's being pulled onto the target into . (wherever you're running gdb) so you don't need to mess with SOLIB paths (the lazy solution).
Alternatively, scp/ftp a new libc (one that you want to use, and ideally one that you have associated symbols for) into /tmp and use LD_LIBRARY_PATH to pull that one (and DL_DEBUG=libs to confirm, if you need). Use that same libc to debug
source: I work at QNX and even we struggle with gdb + libc sometimes
I am working on Ubuntu 14.04 LTS.
I have an executable file exec compiled from file.c. The file.c makes use of functions from a static library. For example, let's says that fubar() is a function of the static library that is used in file.c.
This is something that I have noticed.
nm exec | grep fubar gives a certain value.
(on my system and for my executable, 0808377f)
gdb ./exec and then break fubar gives a different value.
(on my system and for my executable, 0x8083785)
When I do a similar thing for another executable file (exec1 compiled from file1.c, it outputs the same value for both the commands).
Both the commands are supposed to output the same virtual address. Aren't they? I am obviously missing something. Can someone explain what exactly is happening? And what is the difference between both the commands.
Barring unusual things like -fPIE, what is going on here is that the gdb command break function actually means "break after the function prologue for function". This way, arguments are set up properly by the time the breakpoint is hit.
If you want to break exactly at the first instruction of a function, use the * syntax, like:
(gdb) break *function
If you do this the addresses will probably match.
I'm using Linux redhat 3, can someone explain how is that possible that i am able to analyze
with gdb , a core dump generated in Linux redhat 5 ?
not that i complaint :) but i need to be sure this will always work... ?
EDIT: the shared libraries are the same version, so no worries about that, they are placed in a shaerd storage so it can be accessed from both linux 5 and linux 3.
thanks.
You can try following commands of GDB to open a core file
gdb
(gdb) exec-file <executable address>
(gdb) set solib-absolute-prefix <path to shared library>
(gdb) core-file <path to core file>
The reason why you can't rely on it is because every process used libc or system shared library,which will definitely has changes from Red hat 3 to red hat 5.So all the instruction address and number of instruction in native function will be diff,and there where debugger gets goofed up,and possibly can show you wrong data to analyze. So its always good to analyze the core on the same platform or if you can copy all the required shared library to other machine and set the path through set solib-absolute-prefix.
In my experience analysing core file, generated on other system, do not work, because standard library (and other libraries your program probably use) typically will be different, so addresses of the functions are different, so you cannot even get a sensible backtrace.
Don't do it, because even if it works sometimes, you cannot rely on it.
You can always run gdb -c /path/to/corefile /path/to/program_that_crashed. However, if program_that_crashed has no debug infos (i.e. was not compiled and linked with the -g gcc/ld flag) the coredump is not that useful unless you're a hard-core debugging expert ;-)
Note that the generation of corefiles can be disabled (and it's very likely that it is disabled by default on most distros). See man ulimit. Call ulimit -c to see the limit of core files, "0" means disabled. Try ulimit -c unlimited in this case. If a size limit is imposed the coredump will not exceed the limit size, thus maybe cutting off valuable information.
Also, the path where a coredump is generated depends on /proc/sys/kernel/core_pattern. Use cat /proc/sys/kernel/core_pattern to query the current pattern. It's actually a path, and if it doesn't start with / then the file will be generated in the current working directory of the process. And if cat /proc/sys/kernel/core_uses_pid returns "1" then the coredump will have the file PID of the crashed process as file extension. You can also set both value, e.g. echo -n /tmp/core > /proc/sys/kernel/core_pattern will force all coredumps to be generated in /tmp.
I understand the question as:
how is it possible that I am able to
analyse a core that was produced under
one version of an OS under another
version of that OS?
Just because you are lucky (even that is questionable). There are a lot of things that can go wrong by trying to do so:
the tool chains gcc, gdb etc will
be of different versions
the shared libraries will be of
different versions
so no, you shouldn't rely on that.
You have asked similar question and accepted an answer, ofcourse by yourself here : Analyzing core file of shared object
Once you load the core file you can get the stack trace and get the last function call and check the code for the reason of crash.
There is a small tutorial here to get started with.
EDIT:
Assuming you want to know how to analyse core file using gdb on linux as your question is little unclear.