I have always been told that compiler is sufficient smart to eliminate dead code. Much of the code that I am writing has a lot of information known at compile time but the code has to be written in most generic form. I don't know any assembly so I cannot examine the generated assembly. What kind of code that can be effectively eliminated in the final executable?
Few examples but not limited to
f(bool b){
if(b){
//some code
}else{
//some code
}
}
f(true);
//////////////////////////
template<bool b>
f(){
if(b){
//some code
}else{
//some code
}
}
f<true>();
///////////////////////////
What if definition of f is in other objective code and the the called f(true) is in main. Will link time optimisation effectively eliminate the dead code? What is the coding style/compiler option/trick to facilitate dead code elimination?
Typically, if you're compiling with the -O flag on the following flags are turned on:
-fauto-inc-dec
-fcompare-elim
-fcprop-registers
-fdce
[...]
-fdce stands for Dead Code Elimination. I'd suggest you compile your binaries with and without (i.e. by turning off explicitly) this option to make sure if your binaries are as optimized as you'd like them to be.
Read about the different passes of the compiler:
SSA Aggressive Dead Code Elimination. Turned on by the `-fssa-dce'
option. This pass performs elimination of code considered unnecessary
because it has no externally visible effects on the program. It
operates in linear time.
As for helping the linker with dead code elimination go through this presentation. Two major takeaways being:
Compile your modules with -ffunction-sections -fdata-sections – there
are no downsides to it!
This includes static libraries, not just
binaries – make it possible for users of your library to benefit from
more efficient dead code removal.
Link your binaries with
--gc-sections, unless you have to link against nasty third-party static library which uses magic sections.
You may also want to take a look at this GCC bug (to see what chances of optimization may be missed and why).
Your example focuses on dead code elimination inside functions.
Another type of dead code elimination, is removal of entire unused symbols (functions or variables) which can be achieved with:
-fdata-sections -ffunction-sections -Wl,--gc-sections
as mentioned at: How to remove unused C/C++ symbols with GCC and ld?
These flags are not enabled in the various GCC -O levels (-O1, -O2, etc) by default.
When I used template parameter constant in such if expression then dce (Dead Code Elimination) compiler (GCC 4.8.1 on Linux) flags did not helped and O2, O3 optimization also did not helped.
I had to use template specialization wrapper:
template<bool b>
f();
template<>
f<true>(){
//some code on true condition
}
template<>
f<false>(){
//some code on false condition
}
Also macros can be used to avoid compilation of the unused code branch, but it depends on the compiler (whether it processed macros as they occur in the code or on precompiling stage):
template<bool b>
f(){
#if b
//some code
#elif
//some code
#endif // b
}
Related
I have this piece of code:
constexpr static VOID fStart()
{
auto a = 3;
a++;
}
__declspec(naked)
constexpr static VOID fEnd() {};
static constexpr auto getFSize()
{
return (SIZE_T)((PBYTE)fEnd - (PBYTE)fStart);
}
static constexpr auto fSize = getFSize();
static BYTE func[fSize];
Is it possible to declare "func[fSize]" array size as the size of "fStart()" during compilation without using any std library? It is necessary in order to copy the full code of fStart() into this array later.
There is no method in standard C++ to get the length of a function.
You'll need to use a compiler specific method.
One method is to have the linker create a segment, and place your function in that segment. Then use the length of the segment.
You may be able to use some assembly language constructs to do this; depends on the assembler and the assembly code.
Note: in embedded systems, there are reasons to move function code, such as to On-Chip memory or swap to external memory, or to perform a checksum on the code.
The following calculates the "byte size" of the fStart function. However, the size cannot be obtained as a constexpr this way, because casting loses the compile-time const'ness (see for example Why is reinterpret_cast not constexpr?), and the difference of two unrelated function pointers cannot be evaluated without some kind of casting.
#pragma runtime_checks("", off)
__declspec(code_seg("myFunc$a")) static void fStart()
{ auto a = 3; a++; }
__declspec(code_seg("myFunc$z")) static void fEnd(void)
{ }
#pragma runtime_checks("", restore)
constexpr auto pfnStart = fStart; // ok
constexpr auto pfnEnd = fEnd; // ok
// constexpr auto nStart = (INT_PTR)pfnStart; // error C2131
const auto fnSize = (INT_PTR)pfnEnd - (INT_PTR)pfnStart; // ok
// constexpr auto fnSize = (INT_PTR)pfnEnd - (INT_PTR)pfnStart; // error C2131
On some processors and with some known compilers and ABI conventions, you could do the opposite:
generate machine code at runtime.
For x86/64 on Linux, I know GNU lightning, asmjit, libgccjit doing so.
The elf(5) format knows the size of functions.
On Linux, you can generate shared libraries (perhaps generate C or C++ code at runtime (like RefPerSys does and GCC MELT did), then compiling it with gcc -fPIC -shared -O) and later dlopen(3) / dlsym(3) it. And dladdr(3) is very useful. You'll use function pointers.
Read also a book on linkers and loaders.
But you usually cannot move machine code without doing some relocation, unless that machine code is position-independent code (quite often PIC is slower to run than ordinary code).
A related topic is garbage collection of code (or even of agents). You need to read the garbage collection handbook and take inspiration from implementations like SBCL.
Remember also that a good optimizing C++ compiler is allowed to unroll loops, inline expand function calls, remove dead code, do function cloning, etc... So it may happen that machine code functions are not even contiguous: two C functions foo() and bar() could share dozens of common machine instructions.
Read the Dragon book, and study the source code of GCC (and consider extending it with your GCC plugin). Look also into the assembler code produced by gcc -O2 -Wall -fverbose-asm -S. Some experimental variants of GCC might be able to generate OpenCL code running on your GPGPU (and then, the very notion of function end does not make sense)
With generated plugins thru C and C++, you carefully could remove them using dlclose(3) if you use Ian Taylor's libbacktrace and dladdr to explore your call stack. In 99% of the cases, it is not worth the trouble, since in practice a Linux process (on current x86-64 laptops in 2021) can do perhaps a million of dlopen(3), as my manydl.c program demonstrates (it generates "random" C code, compile it into a unique /tmp/generated123.so, and dlopen that, and repeat many times).
The only reason (on desktop and server computers) to overwrite machine code is for long lasting server processes generating machine code every second. If this was your scenario, you should have mentioned it (and generating JVM bytecode by using Java classloaders could make more sense).
Of course, on 16 bits microcontrollers things are very different.
Is it possible to calculate function length at compile time in C++?
No, because at runtime time some functions do not exist anymore.
The compiler have somehow removed them. Or cloned them. Or inlined them.
And for C++ it is practically important with standard containers: a lot of template expansion occurs, including for useless code which has to be removed by your optimizing compiler at some point.
(Think -in 2021 of compilation with a recent GCC 10.2 or 11. using everywhere, and linking with, gcc -O3 -flto -fwhole-program: a function foo23 might be defined but never called, and then it is not inside the ELF executable)
I have a question about inline functions in C++. I know that similar questions have appeared many times on this. I hope that mine is a little bit different.
I know that when you specify some function to be inline it is just a "suggestion" to the compiler. So in case:
inline int func1()
{
return 2;
}
Some code later
cout << func1() << endl; // replaced by cout << 2 << endl;
So there is no mystery there, but what about cases like this:
inline int func1()
{
return 2;
}
inline int func2()
{
return func1() * 2;
}
inline int func3()
{
return func2() * func1() * 2;
}
And so on...
Which of these functions have a chance to become inlined, is it benefitial and how to check what compiler actually did?
Which of these functions have a chance to become inlined
Any and all functions have a chance to become inlined, if the tool(1) doing the inlining has access to the function's definition (= body) ...
is it benefitial
... and deems it beneficial to do so. Nowadays, it's the job of the optimiser to determine where inlining makes sense, and for 99.9% of programs, the best the programmer can do is stay out of the optimiser's way. The remaining few cases are programs like Facebook, where 0.3% of performance loss is a huge regression. In such cases, manual tweaking of optimisations (along with profiling, profiling, and profiling) is the way to go.
how to check what compiler actually did
By inspecting the generated assembly. Every compiler has a flag to make it output assembly in "human-readable" format instead of (or in addition to) object files in binary form.
(1) Normally, this tool is the compiler and inlining happens as part of the compilation step (turning source code into assembly/object files). That is also the only reason why you may be required to use the inline keyword to actually allow a compiler to inline: because the function's definition must be visible in the translation unit (= source file) being compiled, and quite often that means putting the function definition into a header file. Without inline, this would then lead to multiple-definition errors if the header file was included in more than one translation unit.
Note that compilation is not the only stage when inlining is possible. When you enable Whole-Program Optimisation (also known as Link-Time Code Generation), one more pass of optimisation happens at link time, once all object files are created. At this point, the inline keyword is totally irrelevant, since linking has access to all the function definitions (the binary wouldn't link successfully otherwise). This is therefore the way to get the most benefit from inlining without having to think about it at all when writing code. The downside is time: WPO takes time to run, and for large projcts, can prolong link times to unacceptable levels (I've personally experienced a somewhat pathological case where enabling WPO took a program's link time from 7 minutes to 46).
Think of inline as only a hint to the compiler, a bit like register was in old versions of C++ and C standards. Caveat, register is being obsoleted (in C++17).
Which of these functions have a chance to become inlined, is it benefitial
Trust your compiler on making sane inlining decisions. To enable some particular occurrence of a call, the compiler needs to know the body of the called function. You should not care if the compiler is inlining or not (in theory).
In practice, with the GCC compiler:
inlining is not always improving the performance (e.g. because of CPU cache issues, TLB, branch predictor, etc etc....)
inlining decisions depends a lot on optimization options. It probably is more likely to happen with -O3 than with -O1; there are many guru options (like -finline-limit= and others) to tune it.
notice that individual calls get inlined or not. It is quite possible that some call occurrence like foo(x) at line 123 is inlined, but another call occurrence (to the same function foo) like foo(y) at some other place like line 456 is not inlined.
when debugging, you may want to disable inlining (because that makes the debugging more convenient). This is possible with the -fno-inline GCC optimization flag (which I often use with -g, which asks for debugging information).
the always_inline function attribute "forces" inlining, and the noinline prevents it.
if you compile and link with link time optimization (LTO) such as -flto -O2 (or -flto -O3), e.g. with CXX=g++ -flto -O2 in your Makefile, inlining can happen between several translation units (e.g. C++ source files). However LTO is at least doubling the compilation time (and often, worse) and consumes memory during compilation (so better have a lot of RAM then), and often improve performance by only a few percents (with weird exceptions to this rule of thumb).
you might optimize a function differently, e.g. with #pragma GCC optimize ("-O3") or with function attribute optimize
Look also into profile-guided optimizations with instrumentation options like -fprofile-generate and latter optimizations with -fprofile-use with other optimization flags.
If you are curious about what calls are inlined (and sometimes, some won't be) look into the generated assembler (e.g. use g++ -O2 -S -fverbose-asm and look in the .s assembler file), or use some internal dump options.
The observable behavior of your code (except performance) should not depend upon inlining decisions made by your compiler. In other words, don't expect inlining to happen (or not). If your code behave differently with or without some optimization it is likely to be buggy. So read about undefined behavior.
See also MILEPOST GCC project (using machine learning techniques for optimization purposes).
Switching code using the preprocessor is pretty common:
#define MY_SWITCH (1)
#if MY_SWITCH
cout << "on" << Test(1);
#else
cout << "off" << Test(2);
#endif
However if the code outside this snippet changes (e.g. if the Test() function is renamed) it could happen that the disabled line would remain outdated since it is not compiled.
I would like to do this using a different kind of switch to let the code being compiled on every build so I can find outdated lines immediately. E.g. like this:
static const bool mySwitch = true;
if (mySwitch)
{
cout << "on" << Test(1);
}
else
{
cout << "off" << Test(2);
}
However I need to prevent this method to consumes additional ressources. Is there any guaranty (or a reliable assumption) that modern C++ compilers will remove the inactive branch (e.g. using optimization)?
I had this exact problem just a few weeks ago — disabling a problematic diagnostic feature in my codebase revealed that the alternative code had some newish bugs in it that prevented compilation. However, I wouldn't go down the route you propose.
You're sacrificing the benefit of using macros in the first place and not necessarily gaining anything. I expect my compiler to optimise the dead branch away but you can't rely on it and I feel that the macro approach makes it a lot more obvious that there are two distinct "configurations" of your program and only one can ever be used from within a particular build.
I would let your continuous integration system (or whatever is driving automated build tests) cycle through the various combinations of build configuration (provide macros using -D on the commandline, possibly from within your Makefile or other build script, rather than hardcoding them in the source) and test them all.
You don't have guarantee about compiler optimization. (If you want proven optimizations for C, look into compcert).
However, most compilers would optimize in that case, and some might even warn about dead code. Try with recent GCC or Clang/LLVM with optimizations enabled (e.g. g++ -Wall -Wextra -O2).
Also, I believe that most compilers won't consume resources at execution time of the generated optimized code, but they will consume resources at compilation time.
Perhaps using constexpr might help some compilers to optimize better.
Also, look at the produced assembly code (e.g. with g++ -O2 -fverbose-asm -S) or at the intermediate dumps of the compiler (e.g. g++ -O2 -fdump-tree-all which gives hundreds of dump files). If using GCC, you might customize it with MELT to e.g. add additional compile-time checks.
Will the C++ linker automatically inline "pass-through" functions, which are NOT defined in the header, and NOT explicitly requested to be "inlined" through the inline keyword?
For example, the following happens so often, and should always benefit from "inlining", that it seems every compiler vendor should have "automatically" handled it through "inlining" through the linker (in those cases where it is possible):
//FILE: MyA.hpp
class MyA
{
public:
int foo(void) const;
};
//FILE: MyB.hpp
class MyB
{
private:
MyA my_a_;
public:
int foo(void) const;
};
//FILE: MyB.cpp
// PLEASE SAY THIS FUNCTION IS "INLINED" BY THE LINKER, EVEN THOUGH
// IT WAS NOT IMPLICITLY/EXPLICITLY REQUESTED TO BE "INLINED"?
int MyB::foo(void)
{
return my_a_.foo();
}
I'm aware the MSVS linker will perform some "inlining" through its Link Time Code Generation (LTGCC), and that the GCC toolchain also supports Link Time Optimization (LTO) (see: Can the linker inline functions?).
Further, I'm aware that there are cases where this cannot be "inlined", such as when the implementation is not "available" to the linker (e.g., across shared library boundaries, where separate linking occurs).
However, if this is code is linked into a single executable that does not cross DLL/shared-lib boundaries, I'd expect the compiler/linker vendor to automatically inline the function, as a simple-and-obvious optimization (benefiting both performance-and-size)?
Are my hopes too naive?
Here's a quick test of your example (with a MyA::foo() implementation that simply returns 42). All these tests were with 32-bit targets - it's possible that different results might be seen with 64-bit targets. It's also worth noting that using the -flto option (GCC) or the /GL option (MSVC) results in full optimization - wherever MyB::foo() is called, it's simply replaced with 42.
With GCC (MinGW 4.5.1):
gcc -g -O3 -o test.exe myb.cpp mya.cpp test.cpp
the call to MyB::foo() was not optimized away. MyB::foo() itself was slightly optimized to:
Dump of assembler code for function MyB::foo() const:
0x00401350 <+0>: push %ebp
0x00401351 <+1>: mov %esp,%ebp
0x00401353 <+3>: sub $0x8,%esp
=> 0x00401356 <+6>: leave
0x00401357 <+7>: jmp 0x401360 <MyA::foo() const>
Which is the entry prologue is left in place, but immediately undone (the leave instruction) and the code jumps to MyA::foo() to do the real work. However, this is an optimization that the compiler (not the linker) is doing since it realizes that MyB::foo() is simply returning whatever MyA::foo() returns. I'm not sure why the prologue is left in.
MSVC 16 (from VS 2010) handled things a little differently:
MyB::foo() ended up as two jumps - one to a 'thunk' of some sort:
0:000> u myb!MyB::foo
myb!MyB::foo:
001a1030 e9d0ffffff jmp myb!ILT+0(?fooMyAQBEHXZ) (001a1005)
And the thunk simply jumped to MyA::foo():
myb!ILT+0(?fooMyAQBEHXZ):
001a1005 e936000000 jmp myb!MyA::foo (001a1040)
Again - this was largely (entirely?) performed by the compiler, since if you look at the object code produced before linking, MyB::foo() is compiled to a plain jump to MyA::foo().
So to boil all this down - it looks like without explicitly invoking LTO/LTCG, linkers today are unwilling/unable to perform the optimization of removing the call to MyB::foo() altogether, even if MyB::foo() is a simple jump to MyA::foo().
So I guess if you want link time optimization, use the -flto (for GCC) or /GL (for the MSVC compiler) and /LTCG (for the MSVC linker) options.
Is it common ? Yes, for mainstream compilers.
Is it automatic ? Generally not. MSVC requires the /GL switch, gcc and clang the -flto flag.
How does it work ? (gcc only)
The traditional linker used in the gcc toolchain is ld, and it's kind of dumb. Therefore, and it might be surprising, link-time optimization is not performed by the linker in the gcc toolchain.
Gcc has a specific intermediate representation on which the optimizations are performed that is language agnostic: GIMPLE. When compiling a source file with -flto (which activates the LTO), it saves the intermediate representation in a specific section of the object file.
When invoking the linker driver (note: NOT the linker directly) with -flto, the driver will read those specific sections, bundle them together into a big chunk, and feed this bundle to the compiler. The compiler reapplies the optimizations as it usually does for a regular compilation (constant propagation, inlining, and this may expose new opportunities for dead code elimination, loop transformations, etc...) and produces a single big object file.
This big object file is finally fed to the regular linker of the toolchain (probably ld, unless you're experimenting with gold), which performes its linker magic.
Clang works similarly, and I surmise that MSVC uses a similar trick.
It depends. Most compilers (linkers, really) support this kind of optimizations. But in order for it to be done, the entire code-generation phase pretty much has to be deferred to link-time. MSVC calls the option link-time code generation (LTCG), and it is by default enabled in release builds, IIRC.
GCC has a similar option, under a different name, but I can't remember which -O levels, if any, enables it, or if it has to be enabled explicitly.
However, "traditionally", C++ compilers have compiled a single translation unit in isolation, after which the linker has merely tied up the loose ends, ensuring that when translation unit A calls a function defined in translation unit B, the correct function address is looked up and inserted into the calling code.
if you follow this model, then it is impossible to inline functions defined in another translation unit.
It is not just some "simple" optimization that can be done "on the fly", like, say, loop unrolling. It requires the linker and compiler to cooperate, because the linker will have to take over some of the work normally done by the compiler.
Note that the compiler will gladly inline functions that are not marked with the inline keyword. But only if it is aware of how the function is defined at the site where it is called. If it can't see the definition, then it can't inline the call. That is why you normally define such small trivial "intended-to-be-inlined" functions in headers, making their definitions visible to all callers.
Inlining is not a linker function.
The toolchains that support whole program optimization (cross-TU inlining) do so by not actually compiling anything, just parsing and storing an intermediate representation of the code, at compile time. And then the linker invokes the compiler, which does the actual inlining.
This is not done by default, you have to request it explicitly with appropriate command-line options to the compiler and linker.
One reason it is not and should not be default, is that it increases dependency-based rebuild times dramatically (sometimes by several orders of magnitude, depending on code organization).
Yes, any decent compiler is fully capable of inlining that function if you have the proper optimisation flags set and the compiler deems it a performance bonus.
If you really want to know, add a breakpoint before your function is called, compile your program, and look at the assembly. It will be very clear if you do that.
Compiled code must be able to see the content of the function for a chance of inlining. The chance of this happening more can be done though the use of unity files and LTCG.
The inline keyword only acts as a guidance for the compiler to inline functions when doing optimization. In g++, the optimization levels -O2 and -O3 generate different levels of inlining. The g++ doc specifies the following : (i) If O2 is specified -finline-small-functions is turned ON.(ii) If O3 is specified -finline-functions is turned ON along with all options for O2. (iii) Then there is one more relevant options "no-default-inline" which will make member functions inline only if "inline" keyword is added.
Typically, the size of the functions (number of instructions in the assembly), if recursive calls are used determine whether inlining happens. There are plenty more options defined in the link below for g++:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Please take a look and see which ones you are using, because ultimately the options you use determine whether your function is inlined.
Here is my understanding of what the compiler will do with functions:
If the function definition is inside the class definition, and assuming no scenarios which prevent "inline-ing" the function, such as recursion, exist, the function will be "inline-d".
If the function definition is outside the class definition, the function will not be "inline-d" unless the function definition explicitly includes the inline keyword.
Here is an excerpt from Ivor Horton's Beginning Visual C++ 2010:
Inline Functions
With an inline function, the compiler tries to expand the code in the body of the function in place of a call to the function. This avoids much of the overhead of calling the function and, therefore, speeds up your code.
The compiler may not always be able to insert the code for a function inline (such as with recursive functions or functions for which you have obtained an address), but generally, it will work. It's best used for very short, simple functions, such as our Volume() in the CBox class, because such functions execute faster and inserting the body code does not significantly increase the size of the executable module.
With function definitions outside of the class definition, the compiler treats the functions as a normal function, and a call of the function will work in the usual way; however, it's also possible to tell the compiler that, if possible, you would like the function to be considered as inline. This is done by simply placing the keyword inline at the beginning of the function header. So, for this function, the definition would be as follows:
inline double CBox::Volume()
{
return l * w * h;
}
I have been testing inline function calls in C++.
Thread model: win32
gcc version 4.3.3 (4.3.3-tdm-1 mingw32)
Stroustrup in The C++ Programming language wirtes:
The inline specifier is a hint to the compiler that it should attempt to generate code [...] inline rather than laying down the code for the function once and then calling through the usual function call mechanism.
However, I have found out that the generated code is simply not inline. There is a CALL instrction for the isquare function.
Why is this happening? How can I use inline functions then?
EDIT: The command line options used:
**** Build of configuration Debug for project InlineCpp ****
**** Internal Builder is used for build ****
g++ -O0 -g3 -Wall -c -fmessage-length=0 -osrc\InlineCpp.o ..\src\InlineCpp.cpp
g++ -oInlineCpp.exe src\InlineCpp.o
Like Michael Kohne mentioned, the inline keyword is always a hint, and GCC in the case of your function decided not to inline it.
Since you are using Gcc you can force inline with the __attribute((always_inline)).
Example:
/* Prototype. */
inline void foo (const char) __attribute__((always_inline));
Source:GCC inline docs
There is no generic C++ way to FORCE the compiler to create inline functions. Note the word 'hint' in the text you quoted - the compiler is not obliged to listen to you.
If you really, absolutely have to make something be in-line, you'll need a compiler specific keyword, OR you'll need to use macros instead of functions.
EDIT: njsf gives the proper gcc keyword in his response.
Are you looking at a debug build (optimizations disabled)? Compilers usually disable inlining in "debug" builds because they make debugging harder.
In any case, the inline specified is indeed a hint. The compiler is not required to inline the function. There are a number of reasons why any compiler might decide to ignore an inline hint:
A compiler might be simple, and not support inlining
A compiler might use an internal algorithm to decide on what to inline and ignore the hints.
(sometimes, the compiler can do a better job than you can possibly do at choosing what to inline, especially in complex architectures like IA64)
A compiler might use its own heuristics to decide that despite the hint, inlining will not improve performance
Inline is nothing more than a suggestion to the compiler that, if it's possible to inline this function then the compiler should consider doing so. Some functions it will inline automatically because they are so simple, and other functions that you suggest it inlines it won't because they are to complex.
Also, I noticed that you are doing a debug build. I don't actually know, but it's possible that the compiler disables inlining for debug builds because it makes things difficult for the debugger...
It is a hint and the complier can choice to ignore the hint. I think I read some where that GCC generally ignore it. I remeber hearing there was a flag but it still does not work in 100% of cases. (I have not found a link yet).
Flag: -finline-functions is turned on at -O3 optimisation level.
Whether to inline is up to the compiler. Is it free to ignore the inline hint. Some compilers have a specific keyword (like __forceinline in VC++) but even with such a keyword virtual calls to virtual member functions will not be inlined.
I faced similar problems and found that it only works if the inline function is written in a header file.