I'm looking for a nice Stack Overflow-style answer to the first question in the old blog post C++ Code Size, which I'll repeat below:
I’d really like some tool (ideally, g++ based) that shows me what parts of compiled/linked code are generated from what parts of C++ source code. For instance, to see whether a particular template is being instantiated for hundreds of different types (fixable via a template specialization) or whether code is being inlined excessively, or whether particular functions are larger than expected.
If you're looking to find sources of code bloat in your C++ code, I've used 'nm' for that. The following command will list all the symbols in your app with the biggest code and data chunks at the top:
nm --demangle --print-size --size-sort --reverse-sort <executable_or_lib_name> | less
It does seem like something like this should exist, but I haven't used anything like it. I can tell you how I'd go about scripting this together, though. There are probably swifter and/or sexier ways to do it.
First some stuff that you may already know:
The addr2line command takes in an address and can tell you where the source code that the machine code there implements. The executable needs to be built with debugging symbols, and you'll probably not want to optimize it much (-O0, -O1, or -Os is probably as high as you'd want to go at first anyway). addr2line has several flags, and you'll want to read its manual page, but you will definitely need to use -C or --demangle if you want to see C++ function names that make sense in the output.
The objdump command can print out all kinds of interesting things about the stuff in many types of object files. One of the things it can do is print out a table representing the symbols in or referred to by an object file (including executables).
Now, what you want to do with that:
What you'll want to is for objdump to tell you the address and size of the .text section. This is where actual executable machine code lives. There are several ways to do this, but the easiest (for this, anyway) is probably for you to do:
objdump -h my_exe | grep text
That should result in something like:
12 .text 0000049 000000f000 0000000f000 00000400 2**4
If you didn't grep it it would give you a heading like:
Idx Name Size VMA LMA File off Algn
I think for executables the VMA and LMA should be the same, so it won't matter which you use, but I think LMA is the best. You'll also want the size.
With the LMA and size you can repeatedly call addr2line asking for the source code origin of the machine code. I'm not sure how this would work if you passed an address that was within one instruction, but I think it should work.
addr2line -e my_exe <address>
The output from this will be a path/filename, a colon, and a line number.
If you were to count the occurrence of each unique path/file:num you should be able to look at the ones that have the highest counts.
Perl hashes using the path/file:num as the key and a counter as the value would be an easy way to implement this, though there are faster ways if you find that runs too slow.
You could also filter out things that you can determine don't need to be included early.
For displaying your output you may want to filter out different lines from the same function, but you may notice that different lines within one function have different counts, which could be interesting. Anyway, that could be done either by making addr2line tell you the function name or using objdump -t in the first step and work one function at a time.
If you see that some template code or other code lines are showing up in your executables more often than you think they should then you can easily locate them and have a closer look. Macros and inline functions may show end up manifesting themselves differently than you expect.
If you didn't know, objdump and addr2line are from the GNU binutils package, which includes several other useful tools.
I recently wrote a tool, bloat-blame, which does something similar to what nategoose proposed.
In most C compilers there is a way to generate a .map file. This file lists all of the compiled libraries their address and their size. You can use that map file to help you determine which files you should be looking to optimize first.
You can check out bloaty for analyzing the binary size of your program:
https://github.com/google/bloaty
./bloaty bloaty -d compileunits
FILE SIZE VM SIZE
-------------- --------------
34.8% 10.2Mi 43.4% 2.91Mi [163 Others]
17.2% 5.08Mi 4.3% 295Ki third_party/protobuf/src/google/protobuf/descriptor.cc
7.3% 2.14Mi 2.6% 179Ki third_party/protobuf/src/google/protobuf/descriptor.pb.cc
4.6% 1.36Mi 1.1% 78.4Ki third_party/protobuf/src/google/protobuf/text_format.cc
3.7% 1.10Mi 4.5% 311Ki third_party/capstone/arch/ARM/ARMDisassembler.c
1.3% 399Ki 15.9% 1.07Mi third_party/capstone/arch/M68K/M68KDisassembler.c
3.2% 980Ki 1.1% 75.3Ki third_party/protobuf/src/google/protobuf/generated_message_reflection.cc
3.2% 965Ki 0.6% 40.7Ki third_party/protobuf/src/google/protobuf/descriptor_database.cc
2.8% 854Ki 12.0% 819Ki third_party/capstone/arch/X86/X86Mapping.c
2.8% 846Ki 1.0% 66.4Ki third_party/protobuf/src/google/protobuf/extension_set.cc
2.7% 800Ki 0.6% 41.2Ki third_party/protobuf/src/google/protobuf/generated_message_util.cc
2.3% 709Ki 0.7% 50.7Ki third_party/protobuf/src/google/protobuf/wire_format.cc
2.1% 637Ki 1.7% 117Ki third_party/demumble/third_party/libcxxabi/cxa_demangle.cpp
1.8% 549Ki 1.7% 114Ki src/bloaty.cc
1.7% 503Ki 0.7% 48.1Ki third_party/protobuf/src/google/protobuf/repeated_field.cc
1.6% 469Ki 6.2% 427Ki third_party/capstone/arch/X86/X86DisassemblerDecoder.c
1.4% 434Ki 0.2% 15.9Ki third_party/protobuf/src/google/protobuf/message.cc
1.4% 422Ki 0.3% 23.4Ki third_party/re2/re2/dfa.cc
1.3% 407Ki 0.4% 24.9Ki third_party/re2/re2/regexp.cc
1.3% 407Ki 0.4% 29.9Ki third_party/protobuf/src/google/protobuf/map_field.cc
1.3% 397Ki 0.4% 24.8Ki third_party/re2/re2/re2.cc
100.0% 29.5Mi 100.0% 6.69Mi TOTAL
I don't know if it will help but there is a gcc flag to write the assembly code it generates to a text file for your examination.
"-S
Used in place of -c to cause the assembler source file to be generated, using .s as the extension, instead of the object file. This may be useful if you need to examine the generated assembly code. "
I don't know how to map code->generated assembly in general.
For template instantiations you can use something like "strings -a |grep |sort -u|gc++filt" to get a rough picture of what's being created.
The other two items you mentioned seem pretty subjective actually. What is "too much" inlining? Are you worried your binary file is getting inflated? The only thing to do there is actually go into gdb and disassemble the caller to see what it generated, nothing to check for "excessive" inlining in general.
For function size, again I'm curious why it matters? Are you trying to find code that expands unexpectedly when compiled? How do you even define what an expected size is for a tool to examine? Again, you can always dissemble any function that you suspect is compiling to far more code than you want, and see exactly what the compiler is doing.
In Visual C++, this is essentially what .PDB files are for.
Related
The line profiling output of google-pprof claims that most of the running time of my numerical C++ program is being spent in a function called __nss_database_lookup (see below). Apparently that function is for handling things like the passwd file on UNIX systems. My C++ program should only be doing numerical calculations, allocating memory, and passing a few custom C++ data types around.
What's going on? Is the appearance of that function a mirage, a mere artefact of how google-pprof works? Or is it actually being called and wasting two thirds of my program's running time? If it is being called, what could be calling it? Has something mistakenly called it in one of my C++ classes? How would I track that down?
I'm using Ubuntu 20.04, g++-7 and g++-9.
Total: 1046 samples
665 63.6% 63.6% 665 63.6% __nss_database_lookup ??:0
107 10.2% 73.8% 193 18.5% <function1> file.h:1035
92 8.8% 82.6% 92 8.8% <function2> file.h:...
87 8.3% 90.9% 87 8.3% <function3> file.h:995
17 1.6% 92.5% 734 70.2% <function4> file.h:1128
...
(Function and file names obscured for confidentiality reasons)
A friend of mine met the similar issue today. Though it has been a while after you raised the question, but I still would like to answer it so that anyone else who reaches here can get some hints.
This is because some local symbols (which corresponds to static local functions in C/C++) are called, and these symbols don't have their entries in the symbol table, and their text (code) is placed after __nss_database_lookup. So your perf tool treats them as a part of __nss_database_lookup.
For example, your program may call memcpy, and memcpy calls __memmove_unaligned_avx_erms, which is a local symbol in glibc and isn't exported in dynamic symbol table, and its code is placed after __nss_database_lookup coincidentally together with other local symbols. And your perf tool can find nothing about __memmove_unaligned_avx_erms, so it just thinks __nss_database_lookup is called.
A potential solution is to install libc-dbg package (the package name may vary on various distros), and if your perf tool is smart enough to automatically load the debug info, it may annotate symbols correctly. (My friend checked that it took some effects on perf tool)
I'm trying to calculate maximum stack usage of an embedded program using static analysis.
I've used the compiler flag -fstack-usage to get the maximum stack usage for each function and the flag -fdump-rtl-expand to generate a graph of all function calls.
The last missing ingredient is stack usage of built-in functions. (at the moment it's only memset)
I guess I could measure it some other way and put a constant into my script. However, I don't want a situation where the implementation of the built-in function changes in a new version of GCC and the value in my script stays the same.
Maybe there is some way to compile built-in functions with the flag -fstack-usage? Or some other way to measure their stack usage via static analysis?
Edit:
This question is not a duplicate of Stack Size Estimation. The other question is about estimating stack usage of an entire program while I asked how to estimate it for a single built-in library function. The other question doesn't even mention built-in library functions nor any of the answers for it does.
Approach 1 (dynamic analysis)
You could determine stack size at runtime by filling stack with a predefined pattern, executing memset and then checking how many bytes have been modified. This is slower and more involved as you need to compile a sample program, upload it to target (unless you have a simulator) and collect results. You'll also need to be careful about test data that you supply to the function as execution path may change depending on size, data alignment, etc.
For a real-world example of this approach check Abseil's code.
Approach 2 (static analysis)
In general static analysis of binary code is tricky (even disassembling it isn't trivial) and you'd need sophisticated symbolic execution machinery to deal with it (e.g. miasm). But in most cases you can safely rely on detecting patterns which your compiler uses to allocate frames. E.g. for x86_64 GCC you could do something like:
objdump -d /lib64/libc.so.6 | sed -ne '/<__memset_x86_64>:/,/^$/p' > memset.d
NUM_PUSHES=$(grep -c pushq memset.d)
LOCALS=$(sed -ne '/sub .*%rsp/{ s/.*sub \+\$\([^,]\+\),%rsp.*/\1/; p }' memset.d)
LOCALS=$(printf '%d' $LOCALS) # Unhex
echo $(( LOCALS + 8 * NUM_PUSHES ))
Note that this simple approach produces a conservative estimate (getting more precise result is doable but would require a path-sensitive analysis which requires proper parsing, building control-flow graph, etc.) and does not handle nested function calls (can be easily added but should probly be done in a language more expressive than shell).
AVR assembly is in general more complicated because you can't easily detect allocation of space for local variables (modification of stack pointer is split across several in, out and adiw instructions so would require non-trivial parsing in e.g. Python). Simple functions like memset or memcpy don't use local variables so you can still get away with simple greps:
NUM_PUSHES=$(grep -c 'push ' memset.d)
NUM_RCALLS=$(grep -c 'rcall \+\.+0' memset.d)
# A safety check for functions which we can't handle
if grep -qi 'out \+0x3[de]' memset.d; then
echo >&2 'Unable to parse stack modification'
exit 1
fi
echo $((NUM_PUSHES + 2 * NUM_RCALLS))
This is not a great answer but it still may be useful.
Many of built-in functions are very simple. For example memset can be implemented just as a simple loop. From my observation it appears that compiler avoid using stack if it can just use registers (which makes perfect sense). Only very long function need more stack. All that shorter ones need is the return address for ret instruction.
It is relatively safe to assume that simple built-in functions don't use stack at all aside from instructions call and ret, so the amount of memory is equal to size of pointer to a function. (2 bytes in my case)
Keep in mind that embedded systems don't always have Von Neumann architecture and they often store instructions and data in separate memories. Size of pointers to function and data may be different.
I am a game developer, and I often find myself writing specialized, templated containers for my needs.
After watching one of my (now) favorite cppcon talks, I am interested in finding the percentage of my final binary that is taken up by a given class/function template in order to determine if some hoisting would be beneficial.
Here , Nicolas mentions that programmers at Ubisoft used their in-house .obj analyzer to determine that their Array class was taking ~80% of their total binary size before they got it down to ~15% (in debug targets) by hoisting.
I want to know how to write such a tool. Specifically, am looking for a tool that can tell me:
the percentage of total binary size taken by all member and non-member functions of a class given its unmangled name and
the percentage of total binary size taken by an individual function(member or non-member) given its unmangled name.
I want to know how to do this in both windows and linux environments for code compiled with the gcc, clang, and Microsoft compilers.
For example, I would like something along the lines of:
./<toolname> <compiler sepecific options and/or whatever> <class/function name> \
<list of object files>
If the size of a function was asked for:
Summary for function <name>:
Total Size: <size in bytes> (<percentage>)
If the size of a class was asked for:
Summary for class <name>:
Total Size: <size in bytes> (<percentage>)
member function <name>: <size> (<percentage>)
...
non-member function <name>: <size> (<percentage>)
...
I am confident that I could do this myself with a slightly non-trivial program, but I thought it would be a good idea to ask here first in case there is some combination of built-in or freely available CLI tools that can be used to get this kind of information.
If your solution contains a combination of well documented tools, I don't need an explanation. However, if you write a custom tool, I would like a broad overview of your approach.
You can compile the program with debug information, so the symbol table with each symbol position will be present in the executable headers. Use readelf or objdump, for example, to get this information.
I'm have an Arduino Uno R3. I'm making logical objects for each of my sensors using C++. The Arduino has very limited on-board memory 32KB*, and, on average, my compiled objects are coming out around 6KB*.
I am already using the smallest possible data types required, in an attempt to minimize my memory footprint. Is there a compiler flag to minimize the size of the binary, or do I need to use shorter variable and function names, less functions, etc. to minimize my code base?
Also, any other tips or words of advice for minimizing binary size would be appreciated.
*It may not be measured in KB (as I don't have it sitting in front of me), but 1 object is approximately 1/5 of my total memory size, which is prompting my concern.
There are lots of techniques to reduce binary size in addition to what us2012 and others mentioned in the comments, summing them up with some points of my own:
Use -Os to make gcc/g++ optimize for size.
Use -ffunction-sections -fdata-sections to separate each function or data into distinct sections within the translation unit. Combine it with the linker option -Wl,--gc-sections to get rid of any unreferenced sections.
Run strip with at least the following options: -s -R .comment -R .gnu.version. It can be combined with --strip-unneeded to remove all symbols that are not necessary for relocation processing.
If your code does not contain c++-exception-handling you can save a lot of space (up to 30k after all optimize steps mentioned by Tuxdude).
Therefore you have to provide the following flag:
-fno-exceptions
But even if you don't use exceptions, the exception handling can be included!
Check the following steps:
no usage of new, delete. If you really need it replace them by malloc/free wrappers. For an example search for "tinynew.cpp"
provide function for pure virtual call, e.g.extern "C" void __cxa_pure_virtual() { while(1); }
overwrite __gnu_cxx::__verbose_terminate_handler(). It handles unhandled exceptions and does name demangling, which is quite large! (e.g d_print_comp.part.10 with 9.5k or d_type with 1.8k)
Cheers
Flo
I'm writing a very simple process loader for Linux. The executables I'm loading are already compiled, and I know where each one expects to be found in memory. The first approach I tried was using mmap() to manually place each code or data section at the correct location, like
mmap(addr, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0)
which segfaults unless I remove the MAP_FIXED flag because, it seems, the address of one block conflicts with something already in memory, possibly even the loader itself; the address 0x401000 seems to be the problematic one.
I'm not really even sure where to begin with this one. A friend suggested virtualizing memory access operations; I'm not sure what kind of performance hits I'd take for that, and I have no clue how it's done, but it might be an option. What I'd really love to do is create an "empty" process, which would have, as far as it was concerned, full run of the memory, so nothing would be loaded into the user space until I wanted it to be. The whole concept of an "empty" process might be meaningless, but it's the best way to describe what I want. I'm pretty desperate for some references or examples that might help me.
With your process running (maybe snoozing in "sleep(1000);"), look at its /proc/pid/maps. That will tell you what 0x401000 is used for.
~$ sleep 1h &
[3] 2033
~$ cat /proc/2033/maps
00110000-002af000 r-xp 00000000 08:01 1313056 /lib/i386-linux-gnu/libc-2.15.so
...
Here on my box, /bin/sleep doesn't use that block, and neither does my little one-liner program.
You're probably linking in some library which wants to land there?
So one way would be to allocate the block you need way early (long before main() runs -- look elsewhere for that info).
Another way is to link your code to some address you "know" isn't taken (presumably, you're generating the x86 opcodes yourself, or otherwise "linking", so that shouldn't be a stretch).
Another, better, option is to make your code relocatable. The fact that you don't want to replace the entire process's address space (precisely what exec does) more or less says that your code should be just that.
So find a usable address, load the bits there, and, as needed, perform the relocations (so your on-disk file format, if it's not ELF, will need to include reloc info). That's the high road, and the obvious thing you'll want next from your loader.
Of course, that pretty much means reimplementing dlopen() yourself. I assume you're just trying to learn how it works -- if not, man dlopen. Stephane's Rule Zero: it's already there ;-)
Don't forget to support linking other libraries from your code (without duplication), dlclose(), initializers, the various RTLD_* modes, honor MYCUSTOMLD_LIBRARY_PATH, GCC's __thread specifier, etc. ;-)