Ok.. so I'm working doing debugging on x86 with gdb.
The particular files in question are stripped so I have no symbols from the binary itself. I have no access to the source code, but a rough idea of what's happening under the hood.
My asm knowledge is just about good enough to decide the purpose of a function and decide its purpose. Thus I can decide on my own appropriate names for functions after looking at them for a while, but I would like to be able to inject these as symbols so that once decided upon they can be used in later debugging..
Does anybody know how to load custom symbols into gdb?
I've considered recompiling gdb with and adding an extra command to the UI to allow loading of a symbol at an address.. I was wondering if it would be possible to create a dummy object file with the symbols I've defined and then load it using add-symbol-file?
Or would it be possible to compile a c program with dummy function and so how force them to be the correct size and at the correct location and then simply load that??
This sounds like it should be an easy task, but it turns out to be surprisingly annoying, mostly because ELF as a file format is annoying to generate, so most tools are content with parsing it.
As described here, GDB reads the symbol information from two places, first some minimal information from the symbols in the .symtab and/or .dynsym sections, and afterwards more detailed information from the .debug_info section if it is present.
This immediately suggests two possible ways to add the information, either add the symbol to .symtab or generate your own DWARF info including the symbol.
However, generating DWARF from scratch seems to be a really uncommon use case, so the only working approach I've found so far is to use objcopy to add the symbol to the binary itself:
objcopy a.out --add-symbol function_name=.text:0x900,function,global a.out2
Note that gdb doesn't like absolute symbols for functions, I had to specify it as an offset into the .text section to be useful (i.e., be able to set breakpoints on the function and have it appear in backtraces)
Also, I wasn't able to find any way to modify the "size" field of the symbol.
I wouldn't look for a solution in gdb. I would instead try to figure out how to put the symbols back to the binary. Logically, if it is possible to strip the symbols, then it must be possible to add them back. I'd expect linker (ld) or some other tool to allow that.
I recommend to check all the tools in binutils package (objdump, objcopy, nm, ld, ...) - they are capable of many almost miraculous things!
Tomas
Related
I am brand new to C++, trying to create a program to read pixels on the screen on Linux.
I currently compile the project without any optimization flag, as I am unsure what it does to the program, but that would be another question, here's mine:
Is striping certain information from a C++ binary safe?
I found a possibly helpful manual page of strip program.
As I don't really know what striping means in this context, I am unsure if it is as simple as striping all of it with:
-s --strip-all Remove all symbol and relocation information
But, of course, I'd want the program to work flawlessly then, so does it interfere anyhow with program's execution?
As for my motivation for striping: I want to know if it's safe, and as I said already, I repeat:
I don't really know what striping means in this context.
I thought the answerer could have also covered this. For me to decide.
Symbols are used for debugging.
Your application would continue to work with out issues if you strip them; but you may find it harder to debug if there's a problem.
Relocation information is used for dynamic library loading and for address space layout randomisation (thank you #interjay); and from the strip documentation
--remove-relocations=sectionpattern
... Note that using this option inappropriately may make the output file unusable. ...
My project uses template metaprogramming heavily. Most of the action happens inside recursive templates which produce objects and functions with very long (mangled) symbol names.
Despite the build time being only ~30 sec, the resulting executable is about a megabyte, and it's mostly symbol names.
On Linux, adding a -s argument to GCC brings the size down to ~300 KiB, but a quick look with a text editor shows there are still a lot of cumbersome names in there. I can't find how to strip anything properly on OS X… will just write that off for now.
I suspect that the vtable entries for providing typeid(x).name() are taking up a big chunk. Removing all use of the typeid operator did not cause anything more to be stripped on Linux. I think that the default exception handler uses the facility to report the type of an uncaught exception.
How might I maximize strippage and minimize these kilobyte-sized symbols in my executable?
Just run the program strip on the final executable. If you want to be fancier, you can use some other tools to store the debug info separately, but for your stated purpose, just strip a.out is fine. Maybe use the --strip-all option--I haven't tried that myself to see if it differs from the default behavior.
If you really want to try disabling RTTI, well, it's gcc -fno-rtti. But that may break your program badly--only one way to find out I guess.
Some days ago I accidentally opened a C++ executable of a commercial application in Notepad++ and found out that there's quite a lot information about the original source code stored in the executable.
Inside the executable I could find file names (app.c, dlgstat.c, ...), function names (GetTickCount, DispatchMessageA, ...) and small pieces of source code, mostly conditions (szChar != TEXT('\0'), iRow < XTGetRows( hwndList )). After that I checked another QT executable and: yes again source file names and method signatures.
Because of that I am wondering how much source code information is really stored in a C/C++ executable (e.g., compiled using QT or MinGW). Is this probably some kind of debug build still containing the original source? Is this information used for some reflection stuff? Is there any reason why publishers don't remove this stuff?
How much source code information is really stored in a C/C++ executable?
In practice, not much. The source code is not required at runtime. The strings you name come from two things:
The function names (e.g. GetTickCount) are the names of functions imported from other modules. The names are required at runtime because the functions are resolved dynamically (by calling GetProcAddress with the function name).
The conditions are likely assertions: the assert macro stringizes its argument so that when it fires you know what condition was not met.
If you build a DLL, it will also contain a names of all of the functions it exports, so they can be resolved at runtime (the same is likely true for other shared object formats).
Debug symbols may also contain some of the original source code, though it depends on the format used by the debug symbols. These symbols may be contained either in the binary itself or in an auxiliary file (for example, .pdb files used on Windows).
Windows function names: they probably are there just because they are being accessed dynamically - somewhere in your program there's a GetProcAddress to get their address. Still, no reason to worry, every application uses WinAPIs, so there's not much to discover about your executable from that information.
Conditions: probably from some assert-like macro; they are included to allow assert to print what failed condition triggered the failed assertion. Anyhow, in release mode assertions should be removed automatically.
Source file names and method signatures: probably from some usage of __FILE__ and __func__ macros; probably, again, from assert.
Other sources of information about the inner structure of your program is RTTI, that has to provide some representation for every type that typeid could be working on. If you don't need its functionality, you can disable it (but I don't know if that is possible in Qt projects).
Mixed into the binary of a C++ app you will find the names of most global symbols (and debugging symbols if enabled in the compiler), but with extra 'decoration text' that encodes the calling signature of the symbol if it is a function or method. Likewise, the literals of character strings are embedded in clear text. But no where will you find anything like the actual source code that the compiler used to create the binary executable. That information is lost during the compilation process, and it is especially hard to reverse engineer if C++ templates are employed in the build.
A little background: I'm trying to build an AVR binary for an embedded sensor system, and I'm running close to my size limit. I use a few external libraries to help me, but they are rather large when compiled into one object per library. I want to pull these into smaller objects so only the functionality I need is linked into my program. I've already managed to drop the binary size by 2k by splitting up a large library.
It would help a lot to know which objects are being used at each stage of the game so I can split them more efficiently. Is there a way to make ld print which objects it's linking?
I'm not sure about the object level, but I believe you might be able to tackle this on the symbol level using CFLAGS="-fdata-sections -ffunction-sections" and LDFLAGS="-Wl,--gc-sections -Wl,--print-gc-sections". This should get rid of the code for all unreferenced symbols, and display the removed symbols to you as well which might be useful if for some reason you decide to go back to the object file level and want to identify object files only containing removed symbols.
To be more precise, the compiler flags I quoted will ask the compiler to place each function or global variable in a section for itself, and the --gc-sections linker flag will then remove all the sections which have not been used. It might be that each object file contains its own sections, even if the functions therein all share a single section. In that case the linker flag alone should do what you ask for: eliminate whole objects which are not used. The gcc manual states that the compiler flags will increase the object size, and although I hope that the final executable should not be affected by this, I don't know for sure, so you should give the LDFLAGS="-Wl,--gc-sections by itself a try in any case.
The listed option names might be useful keywords to search on stackoverflow for other suggestions on how to reduce the size of the binary. gc-sections e.g. yields 62 matches at the moment.
GCC's backtrace_symbols() only resolves dynamic symbols, since handling all types of symbols is something GCC maintainers do not want to get into.
How would I go about extracting non-dynamic symbols obtained from GCC's backtrace() function myself?
Check out what addr2line does using bfd. That is one approach I have used successfully.
More specifically, backtracefilt gets you basically all the way there, you just need to adapt it to take the addresses from backtrace instead of parsing a file.
libdw, part of elfutils, can be used to read the DWARF debugging information that is present if you compiled with -g.