I'm observing unexpected behaviour (at least I cant find explanation for it) with GCC flag -flto and jemalloc/tcmalloc. Once -flto is used and I link with above libraries malloc/calloc and friends are not replaced by je/tc malloc implementation, the glibc implementation is called. Once I remove -flto flag, everything works as expected. I tried to use -fno-builtin/-fno-builtin-* with -flto but still, it doesnt pick the je/tc malloc implementation.
How the -flto machinery works? Why the binary doesnt pick new implementation? How it even links with -fno-builtin when it should fail on unresolved external for, say, printf?
EDIT001:
GCC 7.3
Sample code
int main()
{
auto p = malloc(1024);
free(p);
return 0;
}
Compilation:
/usr/bin/c++ -O2 -g -DNDEBUG -flto -std=gnu++14 -o
CMakeFiles/flto.dir/main.cpp.o -c
/home/user/Development/CPPJunk/flto/main.cpp
Linkage:
/usr/bin/c++ -O2 -g -DNDEBUG -flto CMakeFiles/flto.dir/main.cpp.o
-o flto -L/home/user/Development/jemalloc -Wl,-rpath,/home/user/Development/jemalloc -ljemalloc
EDIT002:
More suitable sample code
#include <cstdlib>
int main()
{
auto p = malloc(1024);
if (p) {
free(p);
}
auto p1 = new int;
if (p1) {
delete p1;
}
auto p2 = new int[32];
if (p2) {
delete[] p2;
}
return 0;
}
First, your sample code is wrong. Read carefully the C11 standard n1570. When you want to use the standard malloc, you should #include <stdlib.h>.
In C++11 (read n3337) malloc is frowned upon and should not be used (prefer new). If you still want to use std::malloc in C++ you should #include <cstdlib> (which, in GCC, is internally including <stdlib.h>)
Then your sample code is almost C code (once you replace auto with void*), not C++. It could be optimized (once you include <stdlib.h>), even without -flto but with just -O3, according to the as-if rule, to an empty main. (I've even wrote a public report, bismon-chariot-doc.pdf, which has a section §1.4.2 explaining in several pages how that optimization happens).
To optimize around malloc and free, GCC uses some __attribute__(malloc) function attribute in the declaration (inside <stdlib.h>) of malloc.
How the -flto machinery works?
LTO is explained in GCC internals §25.
It works by using some internal (GIMPLE-like and/or SSA-like) representation of the code both at "compile" and at "link" time (actually, the linking step becomes another compilation with whole-program optimization, so your code gets "compiled" twice in practice).
LTO always should (in practice) be used with some optimization flag (e.g. -O2 or even -O3) both at compile and at link time. So you should compile and link with g++ -flto -O2 (it has no practical sense to use -flto without at least -O2 and the exact same optimization flags should be used at compile and at link time).
More precisely -flto also embeds in the object files some internal (GIMPLE-like) representation of the source code, and that is also used "at link time" (notably for optimization and inlining happening again when "linking" your entire program, re-using its GIMPLE). Actually GCC contains some LTO front-end and compiler called lto1 (in addition of the C++ front-end and compiler called cc1plus) and lto1 is (when you link with g++ -flto -O2) used at link time to reprocess these GIMPLE representations.
Probably, libjemalloc has its own headers, and might have inline (or inlinable) functions. Then you also need to use -flto -O2 when compiling that library from its source code (so that its Gimple is stored in the library)
At last, the fact that the usual malloc gets called is independent of -flto. It is a linker issue, not a compiler one. You could try to link -ljemalloc statically (and then you'll better build that library also with gcc -flto -O2; if you don't build it like that you won't get LTO optimizations across malloc calls).
You could pass also -v to your compilation and linking commands to understand what g++ is doing. You could even pass -Wl,--verbose to ask the ld (started by g++) to be verbose.
Notice that LTO (and the internal representations that it is using) is compiler and version specific. The internal (Gimple & SSA) representation is slightly different between GCC 7 & GCC 8 (and in Clang it is very different, so of course incompatible). The dynamic linker ld-linux(8) does not know about LTO.
PS. You could install the libjemalloc-dev package and add #include <jemalloc/jemalloc.h> in your code. See also jemalloc(3) man page. Probably libjemalloc could be configured or patched to define some je_malloc symbol as a replacement for malloc. Then it would be simpler (for LTO) to use je_malloc in your code (to avoid conflict between several malloc ELF symbols). To learn more about symbols in shared libraries, read Drepper's How to Write Shared Libraries paper. And of course you should expect LTO to change the behavior of linking!
Related
Intel Fortran compiler/linker has the optional flag -ipo-c or /Qipo-c which enables the generation of a single interprocedurally-optimized object file from all files, which can be later used for linking. Is there an equivalent flag to Intel's -ipo-cin gfortran?
GCC has -fwhole-program, does that work for gfortran?
Or if you don't want to pass all the Fortran source files on one giant command line, there's -flto link-time optimization which uses a linker "plugin" to run the optimizer on GIMPLE stored in .o files (instead of or as well as machine code).
LTO means you should pass all your optimization options to the invocation of gfortran that does the linking, as well as the gfortran -c that compiles to .o.
So you might use gfortran -ffast-math -O3 -march=native -flto to compile and link, assuming gfortran supports the same options as gcc. (And that -march=native is what you want: make an executable optimized for the computer you compiled on, which might SIGILL on other computers without all the ISA extensions this one supports.)
I recently updated my Linux laptop from Ubuntu 16.04 to 18.04.
I had a STM32 (Cortex-M4) Makefile based project that compiled correctly with the arm-none-eabi g++ version provided by Ubuntu. The generated file required 47620 bytes in the .text section.
With the Ubuntu upgrade, I have also installed an up-to-date version of gcc (from ARM website). Version is 8.2.1.
When I compile the same project (make clean && make), the generated binary do not fit in flash (97424 bytes required, more than twice!). The project is exactly the same (sources, link script, startup files, Makefile).
The compiler options are: -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -DSTM32F303x8 -DARMCM4 -O0 -g -Wall -fexceptions -Wno-deprecated.
The linker options are -mthumb -mcpu=cortex-m4 -Tstm32f303K8.ld -mfloat-abi=hard -mfpu=fpv4-sp-d16 --specs=nosys.specs -lm -Wl,--start-group -lm -Wl,--end-group -Wl,--gc-sections -Lsys -Xlinker -Map=test.elf.map
When I look at the .Map generated file, all the user functions take approximatively the same size (new version saves 8 bytes!). But after, it includes C++ sepcific parts, and one is more than 26Kb (from map file):
.text 0x00000000080079e8 0x683c /usr/local/gcc-arm-none-eabi-8-2018-q4-major/bin/../lib/gcc/arm-none-eabi/8.2.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libstdc++.a(cp-demangle.o)
0x000000000800e13c __cxa_demangle
Note: there is no problem with C only projects, only with C++. The library included are the same (gcc 4.9.3 -> armv7e-m/fpu, and gcc 8.2.1 -> thumb/v7e-m+fp/hard):
libm.a libstdc++.a libc.a libnosys.a libgcc.a
Is there a way to get rid of that so that I can compile and flash my (no so old) project?
regards,
I found a solution using the libstdc++_nano (instead of implicit libstc++). With that, the code size is reduced from 84kb to 26kb!
LDFLAGS += -lstdc++_nano
It just works. Thanks #Henrik, #Matthieu and #EOF for your support!
It might be related to exception handling, as std::terminate(), which is used with exceptions, might call the demangling routine. If you don't need exceptions then try disabling them with -fno-exceptions as described here.
Another solution might be to look at the GCC headers:
Demangling routine.
ABI-mandated entry point in the C++ runtime library for demangling.
[...]
returns a pointer to the start of the NUL-terminated demangled
name, or NULL if the demangling fails. The caller is
responsible for deallocating this memory using free.
The prototype is:
char*
__cxa_demangle(const char* __mangled_name, char* __output_buffer,
size_t* __length, int* __status);
So you could probably just supply your own dummy function returning NULL (Given that all library functions are weak, and can be overridden). I'll advise you to look at the disassembled code first though, and find out how and why it is being called in the first place, since it might change behaviour to just discard functionality).
They also give other advise in This forum post, which might be useful for you as well:
Optimize for size with -Os instead of -O0 (possibly add the -Og option instead, if you prefer easily debuggable code, it is often both smaller and faster than -O0).
Optimize at link-time with -flto while compiling and linking.
Maybe disable RTTI if not used.
Alternate title: Why does my dylib include extra exported symbols when compiled by Xcode vs Makefile?
My company builds a c++ dynamic library (dylib) on the Mac using clang and we recently ported our hand crafted Makefile to the CMake build system and are now using the generated Xcode projects. After ensuring that all the compiler/linker flags and environment variables matched exactly between the two systems, we noticed that the dylib created by CMake/Xcode was slightly larger. Closer examination showed that it contained some additional exported symbols (from templated functions that were never referenced and therefore should never have been instantiated - the specific templates had their definitions and specializations in the source files as we use explicit instantiation frequently, although in this case they were not explicitly instantiated). Examining the disassembly of some of the object files showed slight instruction differences as well. The only thing that got the libraries to match in size and symbols exactly was to match the order of the compiler flags exactly. This appears to demonstrate some order dependent interaction between compiler flags which seems like a compiler bug or at least poorly documented behavior.
For this specific issue, these were the compiler invocations:
clang++ -fvisibility=hidden -fvisibility-ms-compat -c foo.cpp -o foo.o
clang++ -fvisibility-ms-compat -fvisibility=hidden -c foo.cpp -o foo.o
And this was the linker invocation:
clang++ -dynamiclib -o libfoo.dylib foo.o
Displaying the exported symbols with:
nm -g libfoo.dylib
showed the differences. I submitted this LLVM Bug.
Are there ever any valid situations where compiler flag ordering matters?
Microsoft's compilers and pretty much everyone else's have traditionally had very different models for symbol visibility in the object file. The former has for a long time used C and C++ language extensions to control symbol emission by the compiler, and by default not exporting symbols.
It seems likely that -fvisibility=hidden and -fvisibility-ms-compat are mutually exclusive, and that the compiler honours the last one see on its command-line.
In all fairness, there is little documentation for -fvisibility-ms-compat to be had - other than the commit adding it to clang.
I've just ported a STM32 microcontroller project from Keil uVision (using Keil ARM Compiler) to CooCox CoIDE (using GCC ARM Embedded compiler).
Problem is, the code size is the double size when compiled in CoIDE with GCC compared to Keil uVision.
How can this be? What can I do?
Code size in Keil: 54632b (.text)
Code size in CoIDE: 100844b (.text)
GCC compiler flags:
arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -g2 -Wl,-Map=project.map -Os
-Wl,--gc-sections -Wl,-TC:\arm-gcc-link.ld -g -o project.elf -L -lm
I am suspecting CoIDE and GCC to compile a lot of functions and files, that are present in the project, though aren't used (yet). Is it possible that it compiles whole files even if I only use 1 function out of 20 in there? (even though I have -Os)..
Hard to say which files are really compiled/linked in your final binary from the information you give. I suppose it takes all the C files it finds on your project if you did not explicitly specified which one to compile or if you don't use your own Makefile.
But from the compiler options you give, the linker flag --gc-sections won't do much garbage if you don't have the following compiler flags: -ffunction-sections -fdata-sections. Try to add those options to strip all unused functions and data at link time.
Since the question was tagged with C++, I wonder if you would like to disable exceptions and RTTI. Those take quite a bit of code. Add -fno-exceptions -fno-rtti to linker flags.
I'm playing around with a toolchain that seems to wrap gcc (qcc), but also uses g++ for a few things. This caused a bit of confusion when I couldn't link libs I built with g++ using g(q)cc even though it was for the same architecture (due to missing lib errors). After a bit more research, I found that g++ is basically gcc with a few default flags and a slightly different interpretation mechanism for file extensions (there may be other differences I've glanced over). I'd like to know exactly which flags can be passed to gcc to amount to the equivalent g++ call. For instance:
g++ -g -c hello.cpp // I know at the very least that this links in stl
gcc -g -c -??? // I want the exact same result as I got with g++... what flags do I use?
The way the tool chain is set up makes it sort of difficult to simply replace the gcc calls with g++. It'd be much easier to know which flags I need to pass.
The differences between using gcc vs. g++ to compile C++ code is that (a) g++ compiles files with the .c, .h, and .i extensions as C++ instead of C, and that (b) it automatically links with the C++ standard library (-lstdc++). See the man page.
So assuming that you're not compiling .c, .h., or .i files as C++, all you need to do to make gcc act like g++ is add the -lstdc++ command line option to your linker flags. If you are compiling those other files as C++, you can add -x c++, but I'd advise you instead to rename them to use .cc or .ii files (.h can stay that way, if you're using precompiled headers).