I used objdump -t on the debug-info file of a program to find the address ranges of each function. There are a few functions the bounds of which can not be determined using this method. Because objdump reports 0 for their sizes. These symbols are shown, below:
deregister_tm_clones 0000000000197ce0
register_tm_clones 0000000000197d20
__do_global_dtors_aux 0000000000197d70
frame_dummy 0000000000197db0
_fini 00000000004e9474
_init 00000000001889e8
How can I determine their sizes? I can only imagine using GDB disas command on the start address and find the end of the disassembly for the function. This may not work in all cases. What is the standard approach?
UPDATE:
I am implementing a Pintool to generate callstacks at runtime. I only need symbols in certain binaries. In other words, I need a subset of functions (e.g., those in the GTK library) to be included in the callstack. Therefore, at runtime, I will need the ranges for these libraries.
On the other hand, I need the ranges for the symbols to find their outgoing jumps. This is a sign of tail-call elimination, which necessitates stack updates.
Related
I have an MCU (say an STM32) running, and I would like to 'pass' it a separately compiled binary file over UART/USB and use it like calling a function, where I can pass it data and collect its output? After its complete, a second, different binary would be sent to be executed, and so on.
How can I do this? Does this require an OS be running? I'd like to avoid that overhead.
Thanks!
It is somewhat specific to the mcu what the exact call function is but you are just making a function call. You can try the function pointer thing but that has been known to fail with thumb (on gcc)(stm32 uses the thumb instruction set from arm).
First off you need to decide in your overall system design if you want to use a specific address for this code. for example 0x20001000. or do you want to have several of these resident at the same time and want to load them at any one of multiple possible addresses? This will determine how you link this code. Is this code standalone? with its own variables or does it want to know how to call functions in other code? All of this determines how you build this code. The easiest, at least to first try this out, is a fixed address. Build like you build your normal application but based in a ram address like 0x20001000. Then you load the program sent to you at that address.
In any case the normal way to "call" a function in thumb (say an stm32). Is the bl or blx instruction. But normally in this situation you would use bx but to make it a call need a return address. The way arm/thumb works is that for bx and other related instructions the lsbit determines the mode you switch/stay in when branching. Lsbit set is thumb lsbit clear is arm. This is all documented in the arm documentation which completely covers your question BTW, not sure why you are asking...
Gcc and I assume llvm struggles to get this right and then some users know enough to be dangerous and do the worst thing of ADDing one (rather than ORRing one) or even attempting to put the one there. Sometimes putting the one there helps the compiler (this is if you try to do the function pointer approach and hope the compiler does all the work for you *myfun = 0x10000 kind of thing). But it has been shown on this site that you can make subtle changes to the code or depending on the exact situation the compiler will get it right or wrong and without looking at the code you have to help with the orr one thing. As with most things when you need an exact instruction, just do this in asm (not inline please, use real) yourself, make your life 10000 times easier...and your code significantly more reliable.
So here is my trivial solution, extremely reliable, port the asm to your assembly language.
.thumb
.thumb_func
.globl HOP
HOP:
bx r0
I C it looks like this
void HOP ( unsigned int );
Now if you loaded to address 0x20001000 then after loading there
HOP(0x20001000|1);
Or you can
.thumb
.thumb_func
.globl HOP
HOP:
orr r0,#1
bx r0
Then
HOP(0x20001000);
The compiler generates a bl to hop which means the return path is covered.
If you want to send say a parameter...
.thumb
.thumb_func
.globl HOP
HOP:
orr r1,#1
bx r1
void HOP ( unsigned int, unsigned int );
HOP(myparameter,0x20001000);
Easy and extremely reliable, compiler cannot mess this up.
If you need to have functions and global variables between the main app and the downloaded app, then there are a few solutions and they involve resolving addresses, if the loaded app and the main app are not linked at the same time (doing a copy and jump and single link is generally painful and should be avoided, but...) then like any shared library you need to have a mechanism for resolving addresses. If this downloaded code has several functions and global variables and/or your main app has several functions and global variables that the downloaded library needs, then you have to solve this. Essentially one side has to have a table of addresses in a way that both sides agree on the format, could be as a simple array of addresses and both sides know which address is which simply from position. Or you create a list of addresses with labels and then you have to search through the list matching up names to addresses for all the things you need to resolve. You could for example use the above to have a setup function that you pass an array/structure to (structures across compile domains is of course a very bad thing). That function then sets up all the local function pointers and variable pointers to the main app so that subsequent functions in this downloaded library can call the functions in the main app. And/or vice versa this first function can pass back an array structure of all the things in the library.
Alternatively a known offset in the downloaded library there could be an array/structure for example the first words/bytes of that downloaded library. Providing one or the other or both, that the main app can find all the function addresses and variables and/or the caller can be given the main applications function addresses and variables so that when one calls the other it all works... This of course means function pointers and variable pointers in both directions for all of this to work. Think about how .so or .dlls work in linux or windows, you have to replicate that yourself.
Or you go the path of linking at the same time, then the downloaded code has to have been built along with the code being run, which is probably not desirable, but some folks do this, or they do this to load code from flash to ram for various reasons. but that is a way to resolve all the addresses at build time. then part of the binary in the build you extract separately from the final binary and then pass it around later.
If you do not want a fixed address, then you need to build the downloaded binary as position independent, and you should link that with .text and .bss and .data at the same address.
MEMORY
{
hello : ORIGIN = 0x20001000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > hello
.rodata : { *(.rodata*) } > hello
.bss : { *(.bss*) } > hello
.data : { *(.data*) } > hello
}
you should obviously do this anyway, but with position independent then you have it all packed in along with the GOT (might need a .got entry but I think it knows to use .data). Note, if you put .data after .bss with gnu at least and insure, even if it is a bogus variable you do not use, make sure you have one .data then .bss is zero padded and allocated for you, no need to set it up in a bootstrap.
If you build for position independence then you can load it almost anywhere, clearly on arm/thumb at least on a word boundary.
In general for other instruction sets the function pointer thing works just fine. In ALL cases you simply look at the documentation for the processor and see the instruction(s) used for calling and returning or branching and simply use that instruction, be it by having the compiler do it or forcing the right instruction so that you do not have it fail down the road in a re-compile (and have a very painful debug). arm and mips have 16 bit modes that require specific instructions or solutions for switching modes. x86 has different modes 32 bit and 64 bit and ways to switch modes, but normally you do not need to mess with this for something like this. msp430, pic, avr, these should be just a function pointer thing in C should work fine. In general do the function pointer thing then see what the compiler generates and compare that to the processor documentation. (compare it to a non-function pointer call).
If you do not know these basic C concepts of function pointer, linking a bare metal app on an mcu/processor, bootstrap, .text, .data, etc. You need to go learn all that.
The times you decide to switch to an operating system are....if you need a filesystem, networking, or a few things like this where you just do not want to do that yourself. Now sure there is lwip for networking and some embedded filesystem libraries. And multithreading then an os as well, but if all you want to do is generate a branch/jump/call instruction you do not need an operating system for that. Just generate the call/branch/whatever.
Loading and execution a fully linked binary and loading and calling a single function (and returning to the caller) are not really the same thing. The latter is somewhat complicated and involves "dynamic linking", where the code effectively and secures in the same execution environment as the caller.
Loading a complete stand-alone executable in the other hand is more straightforward and is the function of a bootloader. A bootloader loads and jumps to the loaded executable which then establishes it's own execution environment. Returning to the bootloader requires a processor reset.
In this case it would make sense to have the bootloader load and execute code in RAM if you are going to be frequently loading different code. However be aware that on Harvard Architecture devices like STM32, RAM execution may slow down execution because data and instruction fetch share the same bus.
The actual implementation of a bootloader will depend on the target architecture, but for Cortex-M devices is fairly straightforward and dealt with elsewhere.
STM32 actually includes an on-chip bootloader (you need to configure the boot source pins to invoke it), which I believe can load and execute code in RAM. It is normally used to load a secondary bootloader to load and program flash, but it can be used for loading any code.
You do need to build and link your code to run from RAM at the address tle loader locates it, or if supported build position-indeoendent code that can run from anywhere.
The line profiling output of google-pprof claims that most of the running time of my numerical C++ program is being spent in a function called __nss_database_lookup (see below). Apparently that function is for handling things like the passwd file on UNIX systems. My C++ program should only be doing numerical calculations, allocating memory, and passing a few custom C++ data types around.
What's going on? Is the appearance of that function a mirage, a mere artefact of how google-pprof works? Or is it actually being called and wasting two thirds of my program's running time? If it is being called, what could be calling it? Has something mistakenly called it in one of my C++ classes? How would I track that down?
I'm using Ubuntu 20.04, g++-7 and g++-9.
Total: 1046 samples
665 63.6% 63.6% 665 63.6% __nss_database_lookup ??:0
107 10.2% 73.8% 193 18.5% <function1> file.h:1035
92 8.8% 82.6% 92 8.8% <function2> file.h:...
87 8.3% 90.9% 87 8.3% <function3> file.h:995
17 1.6% 92.5% 734 70.2% <function4> file.h:1128
...
(Function and file names obscured for confidentiality reasons)
A friend of mine met the similar issue today. Though it has been a while after you raised the question, but I still would like to answer it so that anyone else who reaches here can get some hints.
This is because some local symbols (which corresponds to static local functions in C/C++) are called, and these symbols don't have their entries in the symbol table, and their text (code) is placed after __nss_database_lookup. So your perf tool treats them as a part of __nss_database_lookup.
For example, your program may call memcpy, and memcpy calls __memmove_unaligned_avx_erms, which is a local symbol in glibc and isn't exported in dynamic symbol table, and its code is placed after __nss_database_lookup coincidentally together with other local symbols. And your perf tool can find nothing about __memmove_unaligned_avx_erms, so it just thinks __nss_database_lookup is called.
A potential solution is to install libc-dbg package (the package name may vary on various distros), and if your perf tool is smart enough to automatically load the debug info, it may annotate symbols correctly. (My friend checked that it took some effects on perf tool)
The mechanism that allows gdb to perform backtrace 1 is well explained.
Starting from the current frame, look at the return address
Look for a function whose code section contains that address.
Theoretically, there might be hundreds of thousands of functions to consider.
I was wondering if there are any inherent limitations that prevent gdb
from creating a lookup table with return address -> function name.
What makes you think GDB does a straight search through all functions? This isn't what happens. GDB organises symbols into a couple of different data structures that allow for more efficient mapping between addresses and the enclosing function.
A good place to start might be here: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/blockframe.c;h=d9c28e0a0176a1d91fec1df089fdc4aa382e8672;hb=HEAD#l118
The mechanism that allows gdb to perform backtrace 1 is well explained.
This isn't at all how GDB performs a backtrace.
The address stored in the rip register points to the current instruction, and has nothing to do with return address.
The return address is stored on the stack, or possibly in another register. To find where it is stored (on x86_64, and assuming e.g. Linux/ELF/DWARF file format), GDB looks up unwind descriptor that covers the current value of RIP. The unwind descriptor also tells GDB how to restore other registers to the state they were just before the current function was called.
You can see unwind descriptors with e.g. readelf -wf a.out command.
Once GDB knows how to find return address and restore registers, it can effectively perform an up command, stepping from current (called) frame into previous (caller) frame.
Now this process repeats, until either GDB finds a special unwind descriptor which says "I am the last, don't try to unwind past me", or some error occurs (e.g. restored RIP is 0).
Notably, nowhere in this process does GDB have to consider thousands of functions.
I have a simple hello world program and after i dumpbin it with /headers flag, i get this output:
FILE HEADER VALUES
8664 machine (x64)
D number of sections
5A3D287F time date stamp Fri Dec 22 18:45:03 2017
48F file pointer to symbol table
2D number of symbols
0 size of optional header
0 characteristics
Summary
F .data
A0 .debug$S
2F .drectve
24 .pdata
B9 .text$mn
18 .xdata
What exactly xdata section do and what it contains? No info on msdn.
For future reference:
.text: codesegment (think functions); there can be multiple of those when enabling function sections or when comdat is involved (for example templates)
.data: datasegment (think global vars); there can be multiple of those when enabling data sections or when comdat is involved (for example templates)
.bss: datasegment initialized to zeros (not present above); there can be multiple of those when enabling data sections or when comdat is involved (for example templates)
.debug: Debug info; like others, there can be multiple of these when function sections are involved.
.pdata: for x86_64, this is the "exception info" for a method, it defines the start/end of a function, and a pointer to the unwind info (see .xdata); inside object files this is duplicated per function
.drectve: not sure; but from the name I'd guess linker directives.
.xdata: for x86_64; this is the unwind info part that pdata points to. It contains where the exception handler of a function is, and what to do to unwind it when an exception occurs: https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019
The "$" postfix is used for sorting. Given:
- .sec$z
- .sec$data
- .sec$a
The sections are sorted before they are merged into an executable (so .sec$a first, then data, then z), this can be used to create start/end symbols to a pe section.
The repeated sections are for things like c++ templates, the compiler will instantiate a template in any translation unit that needs it and then the linker will pick one of those instantiations (usually the first encountered).
Less common are compiler-specific features like Microsoft's __declspec(selectany) that allow a variable to be defined more than once and again the linker will simply pick one of those definitions and discard the rest.
gcc's ld scripts will take all the .text* sections to create the final .text of the linked executable. You can examine those scripts to get an idea of how the linker creates an executable out of object files.
I'm looking for a nice Stack Overflow-style answer to the first question in the old blog post C++ Code Size, which I'll repeat below:
I’d really like some tool (ideally, g++ based) that shows me what parts of compiled/linked code are generated from what parts of C++ source code. For instance, to see whether a particular template is being instantiated for hundreds of different types (fixable via a template specialization) or whether code is being inlined excessively, or whether particular functions are larger than expected.
If you're looking to find sources of code bloat in your C++ code, I've used 'nm' for that. The following command will list all the symbols in your app with the biggest code and data chunks at the top:
nm --demangle --print-size --size-sort --reverse-sort <executable_or_lib_name> | less
It does seem like something like this should exist, but I haven't used anything like it. I can tell you how I'd go about scripting this together, though. There are probably swifter and/or sexier ways to do it.
First some stuff that you may already know:
The addr2line command takes in an address and can tell you where the source code that the machine code there implements. The executable needs to be built with debugging symbols, and you'll probably not want to optimize it much (-O0, -O1, or -Os is probably as high as you'd want to go at first anyway). addr2line has several flags, and you'll want to read its manual page, but you will definitely need to use -C or --demangle if you want to see C++ function names that make sense in the output.
The objdump command can print out all kinds of interesting things about the stuff in many types of object files. One of the things it can do is print out a table representing the symbols in or referred to by an object file (including executables).
Now, what you want to do with that:
What you'll want to is for objdump to tell you the address and size of the .text section. This is where actual executable machine code lives. There are several ways to do this, but the easiest (for this, anyway) is probably for you to do:
objdump -h my_exe | grep text
That should result in something like:
12 .text 0000049 000000f000 0000000f000 00000400 2**4
If you didn't grep it it would give you a heading like:
Idx Name Size VMA LMA File off Algn
I think for executables the VMA and LMA should be the same, so it won't matter which you use, but I think LMA is the best. You'll also want the size.
With the LMA and size you can repeatedly call addr2line asking for the source code origin of the machine code. I'm not sure how this would work if you passed an address that was within one instruction, but I think it should work.
addr2line -e my_exe <address>
The output from this will be a path/filename, a colon, and a line number.
If you were to count the occurrence of each unique path/file:num you should be able to look at the ones that have the highest counts.
Perl hashes using the path/file:num as the key and a counter as the value would be an easy way to implement this, though there are faster ways if you find that runs too slow.
You could also filter out things that you can determine don't need to be included early.
For displaying your output you may want to filter out different lines from the same function, but you may notice that different lines within one function have different counts, which could be interesting. Anyway, that could be done either by making addr2line tell you the function name or using objdump -t in the first step and work one function at a time.
If you see that some template code or other code lines are showing up in your executables more often than you think they should then you can easily locate them and have a closer look. Macros and inline functions may show end up manifesting themselves differently than you expect.
If you didn't know, objdump and addr2line are from the GNU binutils package, which includes several other useful tools.
I recently wrote a tool, bloat-blame, which does something similar to what nategoose proposed.
In most C compilers there is a way to generate a .map file. This file lists all of the compiled libraries their address and their size. You can use that map file to help you determine which files you should be looking to optimize first.
You can check out bloaty for analyzing the binary size of your program:
https://github.com/google/bloaty
./bloaty bloaty -d compileunits
FILE SIZE VM SIZE
-------------- --------------
34.8% 10.2Mi 43.4% 2.91Mi [163 Others]
17.2% 5.08Mi 4.3% 295Ki third_party/protobuf/src/google/protobuf/descriptor.cc
7.3% 2.14Mi 2.6% 179Ki third_party/protobuf/src/google/protobuf/descriptor.pb.cc
4.6% 1.36Mi 1.1% 78.4Ki third_party/protobuf/src/google/protobuf/text_format.cc
3.7% 1.10Mi 4.5% 311Ki third_party/capstone/arch/ARM/ARMDisassembler.c
1.3% 399Ki 15.9% 1.07Mi third_party/capstone/arch/M68K/M68KDisassembler.c
3.2% 980Ki 1.1% 75.3Ki third_party/protobuf/src/google/protobuf/generated_message_reflection.cc
3.2% 965Ki 0.6% 40.7Ki third_party/protobuf/src/google/protobuf/descriptor_database.cc
2.8% 854Ki 12.0% 819Ki third_party/capstone/arch/X86/X86Mapping.c
2.8% 846Ki 1.0% 66.4Ki third_party/protobuf/src/google/protobuf/extension_set.cc
2.7% 800Ki 0.6% 41.2Ki third_party/protobuf/src/google/protobuf/generated_message_util.cc
2.3% 709Ki 0.7% 50.7Ki third_party/protobuf/src/google/protobuf/wire_format.cc
2.1% 637Ki 1.7% 117Ki third_party/demumble/third_party/libcxxabi/cxa_demangle.cpp
1.8% 549Ki 1.7% 114Ki src/bloaty.cc
1.7% 503Ki 0.7% 48.1Ki third_party/protobuf/src/google/protobuf/repeated_field.cc
1.6% 469Ki 6.2% 427Ki third_party/capstone/arch/X86/X86DisassemblerDecoder.c
1.4% 434Ki 0.2% 15.9Ki third_party/protobuf/src/google/protobuf/message.cc
1.4% 422Ki 0.3% 23.4Ki third_party/re2/re2/dfa.cc
1.3% 407Ki 0.4% 24.9Ki third_party/re2/re2/regexp.cc
1.3% 407Ki 0.4% 29.9Ki third_party/protobuf/src/google/protobuf/map_field.cc
1.3% 397Ki 0.4% 24.8Ki third_party/re2/re2/re2.cc
100.0% 29.5Mi 100.0% 6.69Mi TOTAL
I don't know if it will help but there is a gcc flag to write the assembly code it generates to a text file for your examination.
"-S
Used in place of -c to cause the assembler source file to be generated, using .s as the extension, instead of the object file. This may be useful if you need to examine the generated assembly code. "
I don't know how to map code->generated assembly in general.
For template instantiations you can use something like "strings -a |grep |sort -u|gc++filt" to get a rough picture of what's being created.
The other two items you mentioned seem pretty subjective actually. What is "too much" inlining? Are you worried your binary file is getting inflated? The only thing to do there is actually go into gdb and disassemble the caller to see what it generated, nothing to check for "excessive" inlining in general.
For function size, again I'm curious why it matters? Are you trying to find code that expands unexpectedly when compiled? How do you even define what an expected size is for a tool to examine? Again, you can always dissemble any function that you suspect is compiling to far more code than you want, and see exactly what the compiler is doing.
In Visual C++, this is essentially what .PDB files are for.