Code size is doubled when compiling with GCC ARM Embedded? - c++

I've just ported a STM32 microcontroller project from Keil uVision (using Keil ARM Compiler) to CooCox CoIDE (using GCC ARM Embedded compiler).
Problem is, the code size is the double size when compiled in CoIDE with GCC compared to Keil uVision.
How can this be? What can I do?
Code size in Keil: 54632b (.text)
Code size in CoIDE: 100844b (.text)
GCC compiler flags:
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -g2 -Wl,-Map=project.map -Os
-Wl,--gc-sections -Wl,-TC:\arm-gcc-link.ld -g -o project.elf -L -lm
I am suspecting CoIDE and GCC to compile a lot of functions and files, that are present in the project, though aren't used (yet). Is it possible that it compiles whole files even if I only use 1 function out of 20 in there? (even though I have -Os)..

Hard to say which files are really compiled/linked in your final binary from the information you give. I suppose it takes all the C files it finds on your project if you did not explicitly specified which one to compile or if you don't use your own Makefile.
But from the compiler options you give, the linker flag --gc-sections won't do much garbage if you don't have the following compiler flags: -ffunction-sections -fdata-sections. Try to add those options to strip all unused functions and data at link time.

Since the question was tagged with C++, I wonder if you would like to disable exceptions and RTTI. Those take quite a bit of code. Add -fno-exceptions -fno-rtti to linker flags.

Related

Is there a Gfortran flag similar to intel ifort's -ipo-c, that would generate single optimized object from all compiled files?

Intel Fortran compiler/linker has the optional flag -ipo-c or /Qipo-c which enables the generation of a single interprocedurally-optimized object file from all files, which can be later used for linking. Is there an equivalent flag to Intel's -ipo-cin gfortran?
GCC has -fwhole-program, does that work for gfortran?
Or if you don't want to pass all the Fortran source files on one giant command line, there's -flto link-time optimization which uses a linker "plugin" to run the optimizer on GIMPLE stored in .o files (instead of or as well as machine code).
LTO means you should pass all your optimization options to the invocation of gfortran that does the linking, as well as the gfortran -c that compiles to .o.
So you might use gfortran -ffast-math -O3 -march=native -flto to compile and link, assuming gfortran supports the same options as gcc. (And that -march=native is what you want: make an executable optimized for the computer you compiled on, which might SIGILL on other computers without all the ISA extensions this one supports.)

C (Embedded): Code size does not shrink when commenting FreeRTOS' RootTask

I have a tree-shaped process/task architecture in my FreeRTOS construct. The main() simply creates a RootTask (after initializing the HAL), which creates another two tasks, and so on.
I'm currently fighting with flash size (code + constants, basically?), and therefore disabling tasks (=commenting them) in order to show the compiler that most translation units are not even needed, to find out which modules are most expensive.
However, I've "commented my way up" to main() and took out EVERYTHING except the while(1) loop. It still doesn't fit in 128k Flash.
Tried removing all C++ translation units, using even gcc for linking; but still ~100k ".text" section (I would be fine with 10k at this point, considering the application doesn't do anything).
I'm using arm-none-eabi-gcc/g++ 5.4.1. The linker script was generated by ST-CubeMX.
gcc flags:
-mcpu=cortex-m0 -mthumb -Os -s -Wall -Wa -a -ad -alms=build/$(notdir $(<:.c=.lst))
Linker flags:
-mcpu=cortex-m0 -specs=nosys.specs -T$(LDSCRIPT) [some-libraries] -Wl,-Map=$(BUILD_DIR)/$(TARGET).map,--cref -Wl,--gc-sections,--undefined=uxTopUsedPriority,-flto
(also generated by CubeMX, except -flto)
Can someone explain why the Compiler/Linker are not removing unused code from the final binary? Are there tools to investigate further?
Please let me know if more information is needed.
Thank you!
You need to pass the flags -ffunction-sections -fdata-sections to the compiler (gcc/g++) such that --gc-sections in the linker works, and -flto to the compiler such that -flto in the linker can do its work.

GCC, -flto, -fno-builtin and custom function implementation of glibc functions

I'm observing unexpected behaviour (at least I cant find explanation for it) with GCC flag -flto and jemalloc/tcmalloc. Once -flto is used and I link with above libraries malloc/calloc and friends are not replaced by je/tc malloc implementation, the glibc implementation is called. Once I remove -flto flag, everything works as expected. I tried to use -fno-builtin/-fno-builtin-* with -flto but still, it doesnt pick the je/tc malloc implementation.
How the -flto machinery works? Why the binary doesnt pick new implementation? How it even links with -fno-builtin when it should fail on unresolved external for, say, printf?
EDIT001:
GCC 7.3
Sample code
int main()
{
auto p = malloc(1024);
free(p);
return 0;
}
Compilation:
/usr/bin/c++ -O2 -g -DNDEBUG -flto -std=gnu++14 -o
CMakeFiles/flto.dir/main.cpp.o -c
/home/user/Development/CPPJunk/flto/main.cpp
Linkage:
/usr/bin/c++ -O2 -g -DNDEBUG -flto CMakeFiles/flto.dir/main.cpp.o
-o flto -L/home/user/Development/jemalloc -Wl,-rpath,/home/user/Development/jemalloc -ljemalloc
EDIT002:
More suitable sample code
#include <cstdlib>
int main()
{
auto p = malloc(1024);
if (p) {
free(p);
}
auto p1 = new int;
if (p1) {
delete p1;
}
auto p2 = new int[32];
if (p2) {
delete[] p2;
}
return 0;
}
First, your sample code is wrong. Read carefully the C11 standard n1570. When you want to use the standard malloc, you should #include <stdlib.h>.
In C++11 (read n3337) malloc is frowned upon and should not be used (prefer new). If you still want to use std::malloc in C++ you should #include <cstdlib> (which, in GCC, is internally including <stdlib.h>)
Then your sample code is almost C code (once you replace auto with void*), not C++. It could be optimized (once you include <stdlib.h>), even without -flto but with just -O3, according to the as-if rule, to an empty main. (I've even wrote a public report, bismon-chariot-doc.pdf, which has a section §1.4.2 explaining in several pages how that optimization happens).
To optimize around malloc and free, GCC uses some __attribute__(malloc) function attribute in the declaration (inside <stdlib.h>) of malloc.
How the -flto machinery works?
LTO is explained in GCC internals §25.
It works by using some internal (GIMPLE-like and/or SSA-like) representation of the code both at "compile" and at "link" time (actually, the linking step becomes another compilation with whole-program optimization, so your code gets "compiled" twice in practice).
LTO always should (in practice) be used with some optimization flag (e.g. -O2 or even -O3) both at compile and at link time. So you should compile and link with g++ -flto -O2 (it has no practical sense to use -flto without at least -O2 and the exact same optimization flags should be used at compile and at link time).
More precisely -flto also embeds in the object files some internal (GIMPLE-like) representation of the source code, and that is also used "at link time" (notably for optimization and inlining happening again when "linking" your entire program, re-using its GIMPLE). Actually GCC contains some LTO front-end and compiler called lto1 (in addition of the C++ front-end and compiler called cc1plus) and lto1 is (when you link with g++ -flto -O2) used at link time to reprocess these GIMPLE representations.
Probably, libjemalloc has its own headers, and might have inline (or inlinable) functions. Then you also need to use -flto -O2 when compiling that library from its source code (so that its Gimple is stored in the library)
At last, the fact that the usual malloc gets called is independent of -flto. It is a linker issue, not a compiler one. You could try to link -ljemalloc statically (and then you'll better build that library also with gcc -flto -O2; if you don't build it like that you won't get LTO optimizations across malloc calls).
You could pass also -v to your compilation and linking commands to understand what g++ is doing. You could even pass -Wl,--verbose to ask the ld (started by g++) to be verbose.
Notice that LTO (and the internal representations that it is using) is compiler and version specific. The internal (Gimple & SSA) representation is slightly different between GCC 7 & GCC 8 (and in Clang it is very different, so of course incompatible). The dynamic linker ld-linux(8) does not know about LTO.
PS. You could install the libjemalloc-dev package and add #include <jemalloc/jemalloc.h> in your code. See also jemalloc(3) man page. Probably libjemalloc could be configured or patched to define some je_malloc symbol as a replacement for malloc. Then it would be simpler (for LTO) to use je_malloc in your code (to avoid conflict between several malloc ELF symbols). To learn more about symbols in shared libraries, read Drepper's How to Write Shared Libraries paper. And of course you should expect LTO to change the behavior of linking!

g++ arm-none-eabi upgrade from 4.9 to gcc 8.2. Generated binary do not fit any more in flash

I recently updated my Linux laptop from Ubuntu 16.04 to 18.04.
I had a STM32 (Cortex-M4) Makefile based project that compiled correctly with the arm-none-eabi g++ version provided by Ubuntu. The generated file required 47620 bytes in the .text section.
With the Ubuntu upgrade, I have also installed an up-to-date version of gcc (from ARM website). Version is 8.2.1.
When I compile the same project (make clean && make), the generated binary do not fit in flash (97424 bytes required, more than twice!). The project is exactly the same (sources, link script, startup files, Makefile).
The compiler options are: -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -DSTM32F303x8 -DARMCM4 -O0 -g -Wall -fexceptions -Wno-deprecated.
The linker options are -mthumb -mcpu=cortex-m4 -Tstm32f303K8.ld -mfloat-abi=hard -mfpu=fpv4-sp-d16 --specs=nosys.specs -lm -Wl,--start-group -lm -Wl,--end-group -Wl,--gc-sections -Lsys -Xlinker -Map=test.elf.map
When I look at the .Map generated file, all the user functions take approximatively the same size (new version saves 8 bytes!). But after, it includes C++ sepcific parts, and one is more than 26Kb (from map file):
.text 0x00000000080079e8 0x683c /usr/local/gcc-arm-none-eabi-8-2018-q4-major/bin/../lib/gcc/arm-none-eabi/8.2.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libstdc++.a(cp-demangle.o)
0x000000000800e13c __cxa_demangle
Note: there is no problem with C only projects, only with C++. The library included are the same (gcc 4.9.3 -> armv7e-m/fpu, and gcc 8.2.1 -> thumb/v7e-m+fp/hard):
libm.a libstdc++.a libc.a libnosys.a libgcc.a
Is there a way to get rid of that so that I can compile and flash my (no so old) project?
regards,
I found a solution using the libstdc++_nano (instead of implicit libstc++). With that, the code size is reduced from 84kb to 26kb!
LDFLAGS += -lstdc++_nano
It just works. Thanks #Henrik, #Matthieu and #EOF for your support!
It might be related to exception handling, as std::terminate(), which is used with exceptions, might call the demangling routine. If you don't need exceptions then try disabling them with -fno-exceptions as described here.
Another solution might be to look at the GCC headers:
Demangling routine.
ABI-mandated entry point in the C++ runtime library for demangling.
[...]
returns a pointer to the start of the NUL-terminated demangled
name, or NULL if the demangling fails. The caller is
responsible for deallocating this memory using free.
The prototype is:
char*
__cxa_demangle(const char* __mangled_name, char* __output_buffer,
size_t* __length, int* __status);
So you could probably just supply your own dummy function returning NULL (Given that all library functions are weak, and can be overridden). I'll advise you to look at the disassembled code first though, and find out how and why it is being called in the first place, since it might change behaviour to just discard functionality).
They also give other advise in This forum post, which might be useful for you as well:
Optimize for size with -Os instead of -O0 (possibly add the -Og option instead, if you prefer easily debuggable code, it is often both smaller and faster than -O0).
Optimize at link-time with -flto while compiling and linking.
Maybe disable RTTI if not used.

equivalent gcc flags for g++ call

I'm playing around with a toolchain that seems to wrap gcc (qcc), but also uses g++ for a few things. This caused a bit of confusion when I couldn't link libs I built with g++ using g(q)cc even though it was for the same architecture (due to missing lib errors). After a bit more research, I found that g++ is basically gcc with a few default flags and a slightly different interpretation mechanism for file extensions (there may be other differences I've glanced over). I'd like to know exactly which flags can be passed to gcc to amount to the equivalent g++ call. For instance:
g++ -g -c hello.cpp // I know at the very least that this links in stl
gcc -g -c -??? // I want the exact same result as I got with g++... what flags do I use?
The way the tool chain is set up makes it sort of difficult to simply replace the gcc calls with g++. It'd be much easier to know which flags I need to pass.
The differences between using gcc vs. g++ to compile C++ code is that (a) g++ compiles files with the .c, .h, and .i extensions as C++ instead of C, and that (b) it automatically links with the C++ standard library (-lstdc++). See the man page.
So assuming that you're not compiling .c, .h., or .i files as C++, all you need to do to make gcc act like g++ is add the -lstdc++ command line option to your linker flags. If you are compiling those other files as C++, you can add -x c++, but I'd advise you instead to rename them to use .cc or .ii files (.h can stay that way, if you're using precompiled headers).