Switching code using the preprocessor is pretty common:
#define MY_SWITCH (1)
#if MY_SWITCH
cout << "on" << Test(1);
#else
cout << "off" << Test(2);
#endif
However if the code outside this snippet changes (e.g. if the Test() function is renamed) it could happen that the disabled line would remain outdated since it is not compiled.
I would like to do this using a different kind of switch to let the code being compiled on every build so I can find outdated lines immediately. E.g. like this:
static const bool mySwitch = true;
if (mySwitch)
{
cout << "on" << Test(1);
}
else
{
cout << "off" << Test(2);
}
However I need to prevent this method to consumes additional ressources. Is there any guaranty (or a reliable assumption) that modern C++ compilers will remove the inactive branch (e.g. using optimization)?
I had this exact problem just a few weeks ago — disabling a problematic diagnostic feature in my codebase revealed that the alternative code had some newish bugs in it that prevented compilation. However, I wouldn't go down the route you propose.
You're sacrificing the benefit of using macros in the first place and not necessarily gaining anything. I expect my compiler to optimise the dead branch away but you can't rely on it and I feel that the macro approach makes it a lot more obvious that there are two distinct "configurations" of your program and only one can ever be used from within a particular build.
I would let your continuous integration system (or whatever is driving automated build tests) cycle through the various combinations of build configuration (provide macros using -D on the commandline, possibly from within your Makefile or other build script, rather than hardcoding them in the source) and test them all.
You don't have guarantee about compiler optimization. (If you want proven optimizations for C, look into compcert).
However, most compilers would optimize in that case, and some might even warn about dead code. Try with recent GCC or Clang/LLVM with optimizations enabled (e.g. g++ -Wall -Wextra -O2).
Also, I believe that most compilers won't consume resources at execution time of the generated optimized code, but they will consume resources at compilation time.
Perhaps using constexpr might help some compilers to optimize better.
Also, look at the produced assembly code (e.g. with g++ -O2 -fverbose-asm -S) or at the intermediate dumps of the compiler (e.g. g++ -O2 -fdump-tree-all which gives hundreds of dump files). If using GCC, you might customize it with MELT to e.g. add additional compile-time checks.
Related
tl;dr: Can it be ensured somehow (e.g. by writing a unit test) that some things are optimized away, e.g. whole loops?
The usual approach to be sure that something is not included in the production build is wrapping it with #if...#endif. But I prefer to stay with C++ mechanics instead. Even there, instead of complicated template specializations I like to keep implementations simple and argue "hey, the compiler will optimize this out anyway".
Context is embedded SW in automotive (binary size matters) with often poor compilers. They are certified in the sense of safety, but usually not good in optimizations.
Example 1: In a container the destruction of elements is typically a loop:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
This works also for build-in types such as int, as the standard allows the explicit call of the destructor also for any scalar types (C++11 12.4-15). In such case the loop does nothing and is optimized out. In GCC it is, but in another (Aurix) not, I saw a literally empty loop in the disassembly! So that needed a template specialization to fix it.
Example 2: Code, which is intended for debugging, profiling or fault-injection etc. only:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
// Albeit 'dead' section, it may not appear in production binary!
// (size, security, safety...)
// 'if constexpr..' not an option (C++11)
std::cout << "Arg was " << arg << std::endl;
}
// normal code here...
}
I can look at the disassembly, sure. But being an upstream platform software it's hard to control all targets, compilers and their options one might use. The fear is big that due to any reason a downstream project has a code bloat or performance issue.
Bottom line: Is it possible to write the software in a way, that certain code is known to be optimized away in a safe manner as a #if would do? Or a unit tests, which give a fail if optimization is not as expected?
[Timing tests come to my mind for the first problem, but being bare-metal I don't have convenient tools yet.]
There may be a more elegant way, and it's not a unit test, but if you're just looking for that particular string, and you can make it unique,
strings $COMPILED_BINARY | grep "Arg was"
should show you if the string is being included
if constexpr is the canonical C++ expression (since C++17) for this kind of test.
constexpr bool DEBUG = /*...*/;
int main() {
if constexpr(DEBUG) {
std::cerr << "We are in debugging mode!" << std::endl;
}
}
If DEBUG is false, then the code to print to the console won't generate at all. So if you have things like log statements that you need for checking the behavior of your code, but which you don't want to interact with in production code, you can hide them inside if constexpr expressions to eliminate the code entirely once the code is moved to production.
Looking at your question, I see several (sub-)questions in it that require an answer. Not all answers might be possible with your bare-metal compilers as hardware vendors don't care that much about C++.
The first question is: How do I write code in a way that I'm sure it gets optimized. The obvious answer here is to put everything in a single compilation unit so the caller can see the implementation.
The second question is: How can I force a compiler to optimize. Here constexpr is a bless. Depending on whether you have support for C++11, C++14, C++17 or even the upcoming C++20, you'll get different feature sets of what you can do in a constexpr function. For the usage:
constexpr char c = std::string_view{"my_very_long_string"}[7];
With the code above, c is defined as a constexpr variable. Because you apply it to the variable, you require some things:
Your compiler should optimize the code so the value of c is known at compile time. This even holds true for -O0 builds!
All functions used for calculate c are constexpr and available. (and by result, enforce the behaviour of the first question)
No undefined behaviour is allowed to be triggered in the calculation of c. (For the given value)
The negative about this is: Your input needs to be known at compile time.
C++17 also provides if constexpr which has similar requirements: condition needs to be calculated at compile time. The result is that 1 branch of the code ain't allowed to be compiled (as it even can contain elements that don't work on the type you are using).
Which than brings us to the question: How do I ensure sufficient optimizations for my program to run fast enough, even if my compiler ain't well behaving. Here the only relevant answer is: create benchmarks and compare the results. Take the effort to setup a CI job that automates this for you. (And yes, you can even use external hardware although not being that easy) In the end, you have some requirements: handling A should take less than X seconds. Do A several times and time it. Even if they don't handle everything, as long as it's within the requirements, its fine.
Note: As this is about debug, you most likely can track the size of an executable as well. As soon as you start using streams, a lot of conversions to string ... your exe size will grow. (And you'll find it a bless as you will immediately find commits which add 10% to the image size)
And than the final question: You have a buggy compiler, it doesn't meet my requirements. Here the only answer is: Replace it. In the end, you can use any compiler to compiler your code to bare metal, as long as the linker scripts work. If you need a start, C++Now 2018: Michael Caisse “Modern C++ in Embedded Systems” gives you a very good idea of what you need to use a different compiler. (Like a recent Clang or GCC, on which you even can log bugs if the optimization ain't good enough)
Insert a reference to external data or function into the block that should be verified to be optimised away. Like this:
extern void nop();
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
nop();
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
In Debug-Builds, link with an implementation of nop() in a extra compilation unit nop.cpp:
void nop() {}
In Release-Builds, don't provide an implementation.
Release builds will only link if the optimisable code is eliminated.
`- kisch
Here's another nice solution using inline assembly.
This uses assembler directives only, so it might even be kind of portable (checked with clang).
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm(".globl _marker\n_marker:\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
This would leave an exported linker symbol in the compiled executable, if the code isn't optimised away. You can check for this symbol using nm(1).
clang can even stop the compilation right away:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm("_marker=1\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
asm volatile (
".ifdef _marker\n"
".err \"code not optimised away\"\n"
".endif\n"
);
// normal code here...
}
This is not an answer to "How to ensure some code is optimized away?" but to your summary line "Can a unit test be written that e.g. whole loops are optimized away?".
First, the answer depends on how far you see the scope of unit-testing - so if you put in performance tests, you might have a chance.
If in contrast you understand unit-testing as a way to test the functional behaviour of the code, you don't. For one thing, optimizations (if the compiler works correctly) shall not change the behaviour of standard-conforming code.
With incorrect code (code that has undefined behaviour) optimizers can do what they want. (Well, for code with undefined behaviour the compiler can do it also in the non-optimizing case, but sometimes only the deeper analyses peformed during optimization make it possible for the compiler to detect that some code has undefined behaviour.) Thus, if you write unit-tests for some piece of code with undefined behaviour, the test results may differ when you run the tests with and without optimization. But, strictly speaking, this only tells you that the compiler translated the code both times in a different way - it does not guarantee you that the code is optimized in the way you want it to be.
Here's another different way that also covers the first example.
You can verify (at runtime) that the code has been eliminated, by comparing two labels placed around it.
This relies on the GCC extension "Labels as Values" https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
before:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
behind:
if (intptr_t(&&behind) != intptr_t(&&before)) abort();
It would be nice if you could check this in a static_assert(), but sadly the difference of &&label expressions is not accepted as compile-time constant.
GCC insists on inserting a runtime comparison, even though both labels are in fact at the same address.
Interestingly, if you compare the addresses (type void*) directly, without casting them to intptr_t, GCC falsely optimises away the if() as "always true", whereas clang correctly optimises away the complete if() as "always false", even at -O1.
I have a question about inline functions in C++. I know that similar questions have appeared many times on this. I hope that mine is a little bit different.
I know that when you specify some function to be inline it is just a "suggestion" to the compiler. So in case:
inline int func1()
{
return 2;
}
Some code later
cout << func1() << endl; // replaced by cout << 2 << endl;
So there is no mystery there, but what about cases like this:
inline int func1()
{
return 2;
}
inline int func2()
{
return func1() * 2;
}
inline int func3()
{
return func2() * func1() * 2;
}
And so on...
Which of these functions have a chance to become inlined, is it benefitial and how to check what compiler actually did?
Which of these functions have a chance to become inlined
Any and all functions have a chance to become inlined, if the tool(1) doing the inlining has access to the function's definition (= body) ...
is it benefitial
... and deems it beneficial to do so. Nowadays, it's the job of the optimiser to determine where inlining makes sense, and for 99.9% of programs, the best the programmer can do is stay out of the optimiser's way. The remaining few cases are programs like Facebook, where 0.3% of performance loss is a huge regression. In such cases, manual tweaking of optimisations (along with profiling, profiling, and profiling) is the way to go.
how to check what compiler actually did
By inspecting the generated assembly. Every compiler has a flag to make it output assembly in "human-readable" format instead of (or in addition to) object files in binary form.
(1) Normally, this tool is the compiler and inlining happens as part of the compilation step (turning source code into assembly/object files). That is also the only reason why you may be required to use the inline keyword to actually allow a compiler to inline: because the function's definition must be visible in the translation unit (= source file) being compiled, and quite often that means putting the function definition into a header file. Without inline, this would then lead to multiple-definition errors if the header file was included in more than one translation unit.
Note that compilation is not the only stage when inlining is possible. When you enable Whole-Program Optimisation (also known as Link-Time Code Generation), one more pass of optimisation happens at link time, once all object files are created. At this point, the inline keyword is totally irrelevant, since linking has access to all the function definitions (the binary wouldn't link successfully otherwise). This is therefore the way to get the most benefit from inlining without having to think about it at all when writing code. The downside is time: WPO takes time to run, and for large projcts, can prolong link times to unacceptable levels (I've personally experienced a somewhat pathological case where enabling WPO took a program's link time from 7 minutes to 46).
Think of inline as only a hint to the compiler, a bit like register was in old versions of C++ and C standards. Caveat, register is being obsoleted (in C++17).
Which of these functions have a chance to become inlined, is it benefitial
Trust your compiler on making sane inlining decisions. To enable some particular occurrence of a call, the compiler needs to know the body of the called function. You should not care if the compiler is inlining or not (in theory).
In practice, with the GCC compiler:
inlining is not always improving the performance (e.g. because of CPU cache issues, TLB, branch predictor, etc etc....)
inlining decisions depends a lot on optimization options. It probably is more likely to happen with -O3 than with -O1; there are many guru options (like -finline-limit= and others) to tune it.
notice that individual calls get inlined or not. It is quite possible that some call occurrence like foo(x) at line 123 is inlined, but another call occurrence (to the same function foo) like foo(y) at some other place like line 456 is not inlined.
when debugging, you may want to disable inlining (because that makes the debugging more convenient). This is possible with the -fno-inline GCC optimization flag (which I often use with -g, which asks for debugging information).
the always_inline function attribute "forces" inlining, and the noinline prevents it.
if you compile and link with link time optimization (LTO) such as -flto -O2 (or -flto -O3), e.g. with CXX=g++ -flto -O2 in your Makefile, inlining can happen between several translation units (e.g. C++ source files). However LTO is at least doubling the compilation time (and often, worse) and consumes memory during compilation (so better have a lot of RAM then), and often improve performance by only a few percents (with weird exceptions to this rule of thumb).
you might optimize a function differently, e.g. with #pragma GCC optimize ("-O3") or with function attribute optimize
Look also into profile-guided optimizations with instrumentation options like -fprofile-generate and latter optimizations with -fprofile-use with other optimization flags.
If you are curious about what calls are inlined (and sometimes, some won't be) look into the generated assembler (e.g. use g++ -O2 -S -fverbose-asm and look in the .s assembler file), or use some internal dump options.
The observable behavior of your code (except performance) should not depend upon inlining decisions made by your compiler. In other words, don't expect inlining to happen (or not). If your code behave differently with or without some optimization it is likely to be buggy. So read about undefined behavior.
See also MILEPOST GCC project (using machine learning techniques for optimization purposes).
We have a large binary compiled with -g and -O compiler flags. The issue is that setting the breakpoint in some files/line does not breaks at that file/line or breaks in some other line while debugging using gdb. I understand that this could be due to due to the -O compiler flag (used for optimization). Unfortunately I am not in a position to remove the compiler -O flag as there are many a scripts level that I need to take care.
How can I ensure to make the code breaks at a file/line place I want? Is there a line of code that I can add which will always be not optimized or will break when debugging using gdb - I tried something like this -
int x;
int y;
But still then the GDB break point did not work properly - how can I set it correctly?
There are two problems I can think of, inlining and optimisation. Since there is no standard way to tell the compiler to disable inlining and/or optimisation, you'll only be able to do it in a compiler specific way.
To disable inlining in GCC, you can use __attribute__((noinline)) on the method.
To disallow the compiler to optimise functions away (and, untested, give you a stable line of code where you can set your breakpoint), just add this to the code;
asm ("");
Both of these are documented at this page.
I have always been told that compiler is sufficient smart to eliminate dead code. Much of the code that I am writing has a lot of information known at compile time but the code has to be written in most generic form. I don't know any assembly so I cannot examine the generated assembly. What kind of code that can be effectively eliminated in the final executable?
Few examples but not limited to
f(bool b){
if(b){
//some code
}else{
//some code
}
}
f(true);
//////////////////////////
template<bool b>
f(){
if(b){
//some code
}else{
//some code
}
}
f<true>();
///////////////////////////
What if definition of f is in other objective code and the the called f(true) is in main. Will link time optimisation effectively eliminate the dead code? What is the coding style/compiler option/trick to facilitate dead code elimination?
Typically, if you're compiling with the -O flag on the following flags are turned on:
-fauto-inc-dec
-fcompare-elim
-fcprop-registers
-fdce
[...]
-fdce stands for Dead Code Elimination. I'd suggest you compile your binaries with and without (i.e. by turning off explicitly) this option to make sure if your binaries are as optimized as you'd like them to be.
Read about the different passes of the compiler:
SSA Aggressive Dead Code Elimination. Turned on by the `-fssa-dce'
option. This pass performs elimination of code considered unnecessary
because it has no externally visible effects on the program. It
operates in linear time.
As for helping the linker with dead code elimination go through this presentation. Two major takeaways being:
Compile your modules with -ffunction-sections -fdata-sections – there
are no downsides to it!
This includes static libraries, not just
binaries – make it possible for users of your library to benefit from
more efficient dead code removal.
Link your binaries with
--gc-sections, unless you have to link against nasty third-party static library which uses magic sections.
You may also want to take a look at this GCC bug (to see what chances of optimization may be missed and why).
Your example focuses on dead code elimination inside functions.
Another type of dead code elimination, is removal of entire unused symbols (functions or variables) which can be achieved with:
-fdata-sections -ffunction-sections -Wl,--gc-sections
as mentioned at: How to remove unused C/C++ symbols with GCC and ld?
These flags are not enabled in the various GCC -O levels (-O1, -O2, etc) by default.
When I used template parameter constant in such if expression then dce (Dead Code Elimination) compiler (GCC 4.8.1 on Linux) flags did not helped and O2, O3 optimization also did not helped.
I had to use template specialization wrapper:
template<bool b>
f();
template<>
f<true>(){
//some code on true condition
}
template<>
f<false>(){
//some code on false condition
}
Also macros can be used to avoid compilation of the unused code branch, but it depends on the compiler (whether it processed macros as they occur in the code or on precompiling stage):
template<bool b>
f(){
#if b
//some code
#elif
//some code
#endif // b
}
Recently, a coworker pointed out to me that compiling everything into a single file created much more efficient code than compiling separate object files - even with link time optimization turned on. In addition, the total compile time for the project went down significantly. Given that one of the primary reasons for using C++ is code efficiency, this was surprising to me.
Clearly, when the archiver/linker makes a library out of object files, or links them into an executable, even simple optimizations are penalized. In the example below, trivial inlining costs 1.8% in performance when done by the linker instead of the compiler. It seems like compiler technology should be advanced enough to handle fairly common situations like this, but it isn't happening.
Here is a simple example using Visual Studio 2008:
#include <cstdlib>
#include <iostream>
#include <boost/timer.hpp>
using namespace std;
int foo(int x);
int foo2(int x) { return x++; }
int main(int argc, char** argv)
{
boost::timer t;
t.restart();
for (int i=0; i<atoi(argv[1]); i++)
foo (i);
cout << "time : " << t.elapsed() << endl;
t.restart();
for (int i=0; i<atoi(argv[1]); i++)
foo2 (i);
cout << "time : " << t.elapsed() << endl;
}
foo.cpp
int foo (int x) { return x++; }
Results of run: 1.8% performance hit to using linked foo instead of inline foo2.
$ ./release/testlink.exe 100000000
time : 13.375
time : 13.14
And yes, the linker optimization flags (/LTCG) are on.
Your coworker is out of date. The technology has been here since 2003 (on the MS C++ compiler): /LTCG. Link time code generation is dealing with exactly this problem. From what I know the GCC has this feature on the radar for the next generation compiler.
LTCG does not only optimize the code like inlining functions across modules, but actually rearanges code to optimize cache locality and branching for a specific load, see Profile-Guided Optimizations. These options are usualy reserved only for Release builds as the build can take hours to finish: will link a instrumented executable, run a profiling load and then link again with the profiling results. The link contains details about what exactly is optimized with LTCG:
Inlining – For example, if there
exists a function A that frequently
calls function B, and function B is
relatively small, then profile-guided
optimizations will inline function B
in function A.
Virtual Call Speculation – If a
virtual call, or other call through a
function pointer, frequently targets a
certain function, a profile-guided
optimization can insert a
conditionally-executed direct call to
the frequently-targeted function, and
the direct call can be inlined.
Register Allocation – Optimizing with
profile data results in better
register allocation.
Basic Block Optimization – Basic block
optimization allows commonly executed
basic blocks that temporally execute
within a given frame to be placed in
the same set of pages (locality). This
minimizes the number of pages used,
thus minimizing memory overhead.
Size/Speed Optimization – Functions
where the program spends a lot of time
can be optimized for speed.
Function Layout – Based on the call
graph and profiled caller/callee
behavior, functions that tend to be
along the same execution path are
placed in the same section.
Conditional Branch Optimization – With
the value probes, profile-guided
optimizations can find if a given
value in a switch statement is used
more often than other values. This
value can then be pulled out of the
switch statement. The same can be done
with if/else instructions where the
optimizer can order the if/else so
that either the if or else block is
placed first depending on which block
is more frequently true.
Dead Code Separation – Code that is
not called during profiling is moved
to a special section that is appended
to the end of the set of sections.
This effectively keeps this section
out of the often-used pages.
EH Code Separation – The EH code,
being exceptionally executed, can
often be moved to a separate section
when profile-guided optimizations can
determine that the exceptions occur
only on exceptional conditions.
Memory Intrinsics – The expansion of
intrinsics can be decided better if it
can be determined if an intrinsic is
called frequently. An intrinsic can
also be optimized based on the block
size of moves or copies.
I'm not a compiler specialist, but I think the compiler has much more information available at disposal to optimize as it operates on a language tree, as opposed to the linker that has to content itself to operate on the object output, far less expressive than the code the compiler has seen. Hence less effort is spent by linker and compiler development team(s) into making linker optimizations that could match, in theory, the tricks the compiler does.
BTW, I'm sorry I distracted your original question into the ltcg discussion. I now understand your question was a little bit different, more concerned with the link time vs. compile time static optimizations possible/available.
Your coworker is smarter than most of us. Even if it seems a crude approach at first, project inlining into a single .cpp file has one thing that the other approaches like link-time-optimization does not have and will not have for while - reliability
However, you asked this two years ago, and I testify that a lot has changed since then (with g++ at least). Devirtualization is a lot more reliable, for instance.