g++ compiler flag to minimize binary size - c++

I'm have an Arduino Uno R3. I'm making logical objects for each of my sensors using C++. The Arduino has very limited on-board memory 32KB*, and, on average, my compiled objects are coming out around 6KB*.
I am already using the smallest possible data types required, in an attempt to minimize my memory footprint. Is there a compiler flag to minimize the size of the binary, or do I need to use shorter variable and function names, less functions, etc. to minimize my code base?
Also, any other tips or words of advice for minimizing binary size would be appreciated.
*It may not be measured in KB (as I don't have it sitting in front of me), but 1 object is approximately 1/5 of my total memory size, which is prompting my concern.

There are lots of techniques to reduce binary size in addition to what us2012 and others mentioned in the comments, summing them up with some points of my own:
Use -Os to make gcc/g++ optimize for size.
Use -ffunction-sections -fdata-sections to separate each function or data into distinct sections within the translation unit. Combine it with the linker option -Wl,--gc-sections to get rid of any unreferenced sections.
Run strip with at least the following options: -s -R .comment -R .gnu.version. It can be combined with --strip-unneeded to remove all symbols that are not necessary for relocation processing.

If your code does not contain c++-exception-handling you can save a lot of space (up to 30k after all optimize steps mentioned by Tuxdude).
Therefore you have to provide the following flag:
-fno-exceptions
But even if you don't use exceptions, the exception handling can be included!
Check the following steps:
no usage of new, delete. If you really need it replace them by malloc/free wrappers. For an example search for "tinynew.cpp"
provide function for pure virtual call, e.g.extern "C" void __cxa_pure_virtual() { while(1); }
overwrite __gnu_cxx::__verbose_terminate_handler(). It handles unhandled exceptions and does name demangling, which is quite large! (e.g d_print_comp.part.10 with 9.5k or d_type with 1.8k)
Cheers
Flo

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?
No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.
Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02
It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

How to estimate usage of library functions

I'm trying to calculate maximum stack usage of an embedded program using static analysis.
I've used the compiler flag -fstack-usage to get the maximum stack usage for each function and the flag -fdump-rtl-expand to generate a graph of all function calls.
The last missing ingredient is stack usage of built-in functions. (at the moment it's only memset)
I guess I could measure it some other way and put a constant into my script. However, I don't want a situation where the implementation of the built-in function changes in a new version of GCC and the value in my script stays the same.
Maybe there is some way to compile built-in functions with the flag -fstack-usage? Or some other way to measure their stack usage via static analysis?
Edit:
This question is not a duplicate of Stack Size Estimation. The other question is about estimating stack usage of an entire program while I asked how to estimate it for a single built-in library function. The other question doesn't even mention built-in library functions nor any of the answers for it does.
Approach 1 (dynamic analysis)
You could determine stack size at runtime by filling stack with a predefined pattern, executing memset and then checking how many bytes have been modified. This is slower and more involved as you need to compile a sample program, upload it to target (unless you have a simulator) and collect results. You'll also need to be careful about test data that you supply to the function as execution path may change depending on size, data alignment, etc.
For a real-world example of this approach check Abseil's code.
Approach 2 (static analysis)
In general static analysis of binary code is tricky (even disassembling it isn't trivial) and you'd need sophisticated symbolic execution machinery to deal with it (e.g. miasm). But in most cases you can safely rely on detecting patterns which your compiler uses to allocate frames. E.g. for x86_64 GCC you could do something like:
objdump -d /lib64/libc.so.6 | sed -ne '/<__memset_x86_64>:/,/^$/p' > memset.d
NUM_PUSHES=$(grep -c pushq memset.d)
LOCALS=$(sed -ne '/sub .*%rsp/{ s/.*sub \+\$\([^,]\+\),%rsp.*/\1/; p }' memset.d)
LOCALS=$(printf '%d' $LOCALS) # Unhex
echo $(( LOCALS + 8 * NUM_PUSHES ))
Note that this simple approach produces a conservative estimate (getting more precise result is doable but would require a path-sensitive analysis which requires proper parsing, building control-flow graph, etc.) and does not handle nested function calls (can be easily added but should probly be done in a language more expressive than shell).
AVR assembly is in general more complicated because you can't easily detect allocation of space for local variables (modification of stack pointer is split across several in, out and adiw instructions so would require non-trivial parsing in e.g. Python). Simple functions like memset or memcpy don't use local variables so you can still get away with simple greps:
NUM_PUSHES=$(grep -c 'push ' memset.d)
NUM_RCALLS=$(grep -c 'rcall \+\.+0' memset.d)
# A safety check for functions which we can't handle
if grep -qi 'out \+0x3[de]' memset.d; then
echo >&2 'Unable to parse stack modification'
exit 1
fi
echo $((NUM_PUSHES + 2 * NUM_RCALLS))
This is not a great answer but it still may be useful.
Many of built-in functions are very simple. For example memset can be implemented just as a simple loop. From my observation it appears that compiler avoid using stack if it can just use registers (which makes perfect sense). Only very long function need more stack. All that shorter ones need is the return address for ret instruction.
It is relatively safe to assume that simple built-in functions don't use stack at all aside from instructions call and ret, so the amount of memory is equal to size of pointer to a function. (2 bytes in my case)
Keep in mind that embedded systems don't always have Von Neumann architecture and they often store instructions and data in separate memories. Size of pointers to function and data may be different.

How to make C++ linking take less memory

I am working on a university research project in C++ with a lot of templates which have further nested templates and so on. The project is about efficient index data structures for a specific area of research. You can imagine: An index structure has a lot of parameters to adjust, so we use the template parameters excessively. Of course, we want to test our indexes with different sets of parameters, so there are quite a lot of template instantiations.
The project is not that huge. Maybe 50k LOC. But still, linking take 50 seconds and consumes over 7 GB memory (!!!). I am on a 32GB workstation so everything is fine for me. I often have bachelor and master students working on this project. The problem is that they often work on Laptops with 4 or 8 GB of RAM. Thus, these students have great troubles compiling that project. The resulting test binary (i.e. a binary that simply contains unit tests for the index structures) is 700 megabytes.
Most of that are symbols, because the nested templates produce huge names. If I use strip on the binary, it drops down to 8 megabytes.
So is there a way to reduce the RAM usage during linkage? And is there a way to have smaller symbols even with nested templates?
We compile using g++4.9 with std=c++11 under Ubuntu 14.10.
Edit:
It really seems to be the nested templates. We have two test cases with really deeply nested templates. The two .o files for these test make up almost 90% of the memory of the final binary. They result in method names that are over 3000 characters long. There is no way to not use the nested templates here as they form a "processing tree" of an example query. Is there any way to keep names short when using deeply nested templates?
GCC has a garbage collection scheme for the RAM it uses.
The parameters ggc-min-expand and ggc-min-heapsize are used to determine when GCC should clean up and dealloc its unused memory (their defaults are percentages of the total system memory).
You could try something like:
g++ --param ggc-min-expand=0 --param ggc-min-heapsize=8192
From the GCC manual:
ggc-min-expand
GCC uses a garbage collector to manage its own memory allocation. This parameter specifies the minimum percentage by which the garbage
collector's heap should be allowed to expand between collections.
Tuning this may improve compilation speed; it has no effect on code
generation.
The default is 30% + 70% * (RAM/1GB) with an upper bound of 100% when RAM >= 1GB. If getrlimit is available, the notion of "RAM" is the
smallest of actual RAM, RLIMIT_RSS, RLIMIT_DATA and RLIMIT_AS. If GCC
is not able to calculate RAM on a particular platform, the lower bound
of 30% is used. Setting this parameter and ggc-min-heapsize to zero
causes a full collection to occur at every opportunity. This is
extremely slow, but can be useful for debugging.
ggc-min-heapsize
Minimum size of the garbage collector's heap before it begins
bothering to collect garbage. The first collection occurs after the
heap expands by ggc-min-expand% beyond ggc-min-heapsize. Again, tuning
this may improve compilation speed, and has no effect on code
generation.
The default is RAM/8, with a lower bound of 4096 (four megabytes) and an upper bound of 131072 (128 megabytes). If getrlimit is
available, the notion of "RAM" is the smallest of actual RAM,
RLIMIT_RSS, RLIMIT_DATA and RLIMIT_AS. If GCC is not able to calculate
RAM on a particular platform, the lower bound is used. Setting this
parameter very large effectively disables garbage collection. Setting
this parameter and ggc-min-expand to zero causes a full collection to
occur at every opportunity.
Further details:
Compiling with GCC on low memory VPS
So is there a way to reduce the RAM usage during linkage? And is there a way to have smaller symbols even with nested templates?
Have you considered using the pimpl idiom in client code?
Consider a situation where you have this include chain:
A.h -> B.h -> C.h -> D.h (C includes D, B includes C, etc)
Suppose that A.h defines class AA, B.h defines class B and so on (with AA implemented in terms of BB, BB implemented in terms of CC and so on).
If DD is a large template and used in the implementation of CC, the templated code will be compiled three times, for the compilation units A, B and C.
Now, consider what happens if instead of C.h including D.h, you have the following situation:
C.h foward declares a CCImpl *pimpl and forwards all it's methods to pImpl-> methods (and does not include D.h).
C.cpp includes C.h and D.h, and implements CCImpl and CC.
Now, D will be included once (and compiled once, for C.cpp). A and B will only include C.h, with a CImpl forward declaration. A.h, B.h and C.h no longer know that a template exists.
Use inheritance judiciously. You may have a class Foo<1,4,8,1,9,int, std::string> as a base class for class Bar, and then the object file will mention just Bar.
Note that typedef does not introduce names for linking purposes.
[edit]
And to address a performance concern from another comment, an empty derived class adds no overhead on common compilers at normal optimization levels (and often there's no overhead even in debug builds)

How much is 32 kB of compiled code

I am planning to use an Arduino programmable board. Those have quite limited flash memories ranging between 16 and 128 kB to store compiled C or C++ code.
Are there ways to estimate how much (standard) code it will represent ?
I suppose this is very vague, but I'm only looking for an order of magnitude.
The output of the size command is a good starting place, but does not give you all of the information you need.
$ avr-size program.elf
text data bss dec hex filename
The size of your image is usually a little bit more than the sum of the text and the data sections. The bss section is essentially compressed because it is all 0s. There may be other sections which are relevant which aren't listed by size.
If your build system is set up like ones that I've used before for AVR microcontrollers then you will end up with an *.elf file as well as a *.bin file, and possibly a *.hex file. The *.bin file is the actual image that would be stored in the program flash of the processor, so you can examine its size to determine how your program is growing as you make edits to it. The *.bin file is extracted from the *.elf file with the objdump command and some flags which I can't remember right now.
If you are wanting to know how to guess-timate how your much your C or C++ code will produce when compiled, this is a lot more difficult. I have observed a 10x blowup in a function when I tried to use a uint64_t rather than a uint32_t when all I was doing was incrementing it (this was about 5 times more code than I thought it would be). This was mostly to do with gcc's avr optimizations not being the best, but smaller changes in code size can creep in from seemingly innocent code.
This will likely be amplified with the use of C++, which tends to hide more things that turn into code than C does. Chief among the things C++ hides are destructor calls and lots of pointer dereferencing which has to do with the this pointer in objects as well as a secret pointer many objects have to their virtual function table and class static variables.
On AVR all of this pointer stuff is likely to really add up because pointers are twice as big as registers and take multiple instructions to load. Also AVR has only a few register pairs that can be used as pointers, which results in lots of moving things into and out of those registers.
Some tips for small programs on AVR:
Use uint8_t and int8_t instead of int whenever you can. You could also use uint_fast8_t and int_fast8_t if you want your code to be portable. This can lead to many operations taking up only half as much code, because int is two bytes.
Be very aware of things like string and struct constants and literals and how/where they are stored.
If you're not scared of it, read the AVR assembly manual. You can get an idea of the types of instructions, and from that the type of C code that easily maps to those instructions. Use that kind of C code.
You can't really say there. The length of the uncompiled code has little to do with the length of the compiled code. For example:
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
int main()
{
std::vector<std::string> strings;
strings.push_back("Hello");
strings.push_back("World");
std::sort(strings.begin(), strings.end());
std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, ""));
}
vs
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
int main()
{
std::vector<std::string> strings;
strings.push_back("Hello");
strings.push_back("World");
for ( int idx = 0; idx < strings.size(); idx++ )
std::cout << strings[idx];
}
Both are the exact same number of lines, and produce the same output, but the first example involves an instantiation of std::sort, which is probably an order of magnitude more code than the rest of the code here.
If you absolutely need to count number of bytes used in the program, use assembler.
Download the arduino IDE and 'verify' some of your existing code, or look at the sample sketches. It will tell you how many bytes that code is, which will give you an idea of how much more you can fit into a given device. Picking a couple of the examples at random, the web server example is 5816 bytes, and the LCD hello world is 2616. Both use external libraries.
Try creating a simplified version of your app, focusing on the most valuable feature first, then start adding up the 'nice (and cool) stuff to have'. Keep an eye on the byte usage shown in the Arduino IDE when you verify your code.
As a rough indication, my first app (LED flasher controlled by a push buttun) requires 1092 bytes. That`s roughly 1K out of 32k. Pretty small footprint for C++ code!
What worries me most is the limited amount of RAM (1 Kb). If the CPU stack takes some of it, then there isn`t much left for creating any data structures.
I only had my Arduino for 48 hrs, so there is still a lot to use it effectively ;-) But it's a lot of fun to use :).
It's quite a bit for a reasonably complex piece of software, but you will start bumping into the limit if you want it to have a lot of different functionality. Also, if you want to store quite a lot of static strings and data, it can eat into that quite quickly. But 32 KB is a decent amount for embedded applications. It tends to be RAM that you have problems with first!
Also, quite often the C++ compilers for embedded systems are a lot worse than the C compilers.
That is, they are nowhere as good as C++ compilers for the common desktop OS's (in terms of producing efficient machine code for the target platform).
At a linux system you can do some experiments with static compiled example programs. E.g.
$ size `which busybox `
text data bss dec hex filename
1830468 4448 25650 1860566 1c63d6 /bin/busybox
The sizes are given in bytes. This output is independent from the executable file format, since the sizes of the different sections inside the file format. The text section contains the machine code and const stufff. The data section contains data for static initialization of variables. The bss size is the size of uninitialized data - of course uninitialized data does not need to be stored in the executable file.)
Well, busybox contains a lot of functionality (like all common shell commands, a shell etc.).
If you link own examples with gcc -static, keep in mind, that your used libc may dramatically increase the program size and that using an embedded libc may be much more space efficient.
To test that you can check out the diet-libc or uclibc and link against that. Actually, busybox is usually linked against uclibc.
Note that the sizes you get this way give you only an order of magnitude. For example, your workstation probably uses another CPU architecture than the arduino board and the machine code of different architecture may differ, more or less, in its size (because of operand sizes, available instructions, opcode encoding and so one).
To go on with rough order of magnitude reasoning, busybox contains roughly 309 tools (including ftp daemon and such stuff), i.e. the average code size of a busybox tool is roughly 5k.

What makes EXE's grow in size?

My executable was 364KB in size. It did not use a Vector2D class so I implemented one with overloaded operators.
I changed most of my code from
point.x = point2.x;
point.y = point2.y;
to
point = point2;
This resulted in removing nearly 1/3 of my lines of code and yet my exe is still 364KB. What exactly causes it to grow in size?
The compiler probably optimised your operator overload by inlining it. So it effectively compiles to the same code as your original example would. So you may have cut down a lot of lines of code by overloading the assignment operator, but when the compiler inlines, it takes the contents of your assignment operator and sticks it inline at the calling point.
Inlining is one of the ways an executable can grow in size. It's not the only way, as you can see in other answers.
What makes EXE’s grow in size?
External libraries, especially static libraries and debugging information, total size of your code, runtime library. More code, more libraries == larger exe.
To reduce size of exe, you need to process exe with gnu strip utility, get rid of all static libraries, get rid of C/C++ runtime libraries, disable all runtime checks and turn on compiler size optimizations. Working without CRT is a pain, but it is possible. Also there is a wcrt (alternative C runtime) library created for making small applications (by the way, it hasn't been updated/maintained during last 5 years).
The smallest exe that I was able create with msvc compiler is somewhere around 16 kilobytes. This was a windows application that displayed single window and required msvcrt.dll to run. I've modified it a bit, and turned it into practical joke that wipes out picture on monitor.
For impressive exe size reduction techniques, you may want to look at .kkrieger. It is a 3D first person shooter, 96 kilobytes total. The game has a large and detailed level, supports shaders, real-time shadows, etc. I.e. comparable with Saurbraten (see screenshots). The smallest working windows application (3d demo with music) I ever encountered was 4 kilobytes big, and used compression techniques and (probably) undocumented features (i.e. the fact that *.com executbale could unpack and launch win32 exe on windows xp)..
In most cases, size of *.exe shouldn't really bother you (I haven't seen a diskette for a few years), as long as it is reasonable (below 100 megabytes). For example of "unreasonable" file size see debug build of Qt 4 for mingw.
This resulted in removing nearly 1/3 of my lines of code and yet my exe is still 364KB.
Most likely it is caused by external libraries used by compiler, runtime checks, etc.
Also, this is an assignment operation. If you aren't using custom types for x (with copy constructor), "copy" operation is very likely to result in small number of operations - i.e. removing 1/3 of lines doesn't guarantee that your code will be 1/3 shorter.
If you want to see how much impact your modification made, you could "ask" compiler to produce asm listing for both versions of the program then compare results (manually or with diff). Or you could disasm/compare both versions of executable. BUt I'm certain that using GNU strip or removing extra libraries will have more effect than removing assignment operators.
What type is point? If it's two floats, then the compiler will implicitly do a member-by-member copy, which is the same thing you did before.
EDIT: Apparently some people in today's crowd didn't understand this answer and compensated by downvoting. So let me elaborate:
Lines of code have NO relation to the executable size. The source code tells the compiler what assembly line to create. One line of code can cause hundreds if not thousands of assembly instructions. This is particularly true in C++, where one line can cause implicit object construction, destruction, copying, etc.
In this particular case, I suppose that "point" is a class with two floats, so using the assignment operator will perform a member-by-member copy, i.e. it takes every member individually and copies it. Which is exactly the same thing he did before, except that now it's done implicitly. The resulting assembly (and thus executable size) is the same.
Executables are most often sized in 'pages' rather than discrete bytes.
I think this a good example why one shouldn't worry too much about code being too verbose if you have a good optimizing compiler. Instead always code clearly so that fellow programmers can read your code and leave the optimization to the compiler.
Some links to look into
http://www2.research.att.com/~bs/bs_faq.html#Hello-world
GCC C++ "Hello World" program -> .exe is 500kb big when compiled on Windows. How can I reduce its size?
http://www.catch22.net/tuts/minexe
As for Windows, lots of compiler options in VC++ may be activated like RTTI, exception handling, buffer checking, etc. that may add more behind the scenes to the overall size.
When you compile a c or c++ program into an executable, the compiler translates your code into machine code, and applying optimizations as it sees fit.
But simply, more code = more machine code to generate = more size to the executable.
Also, check if you have lot of static/global objects. This substantially increase your exe size if they are not zero initialized.
For example:
int temp[100] = {0};
int main()
{
}
size of the above program is 9140 bytes on my linux machine.
if I initialize temp array to 5, then the size will shoot up by around 400 bytes. The size of the below program on my linux machine is 9588.
int temp[100] = {5};
int main()
{
}
This is because, zero initialized global objects go into .bss segment, which ill be initialized at once during program startup. Where as non zero initialized objects contents will be embedded in the exe itself.