How to ensure some code is optimized away?

How to ensure some code is optimized away? - c++

tl;dr: Can it be ensured somehow (e.g. by writing a unit test) that some things are optimized away, e.g. whole loops?
The usual approach to be sure that something is not included in the production build is wrapping it with #if...#endif. But I prefer to stay with C++ mechanics instead. Even there, instead of complicated template specializations I like to keep implementations simple and argue "hey, the compiler will optimize this out anyway".
Context is embedded SW in automotive (binary size matters) with often poor compilers. They are certified in the sense of safety, but usually not good in optimizations.
Example 1: In a container the destruction of elements is typically a loop:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
This works also for build-in types such as int, as the standard allows the explicit call of the destructor also for any scalar types (C++11 12.4-15). In such case the loop does nothing and is optimized out. In GCC it is, but in another (Aurix) not, I saw a literally empty loop in the disassembly! So that needed a template specialization to fix it.
Example 2: Code, which is intended for debugging, profiling or fault-injection etc. only:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
// Albeit 'dead' section, it may not appear in production binary!
// (size, security, safety...)
// 'if constexpr..' not an option (C++11)
std::cout << "Arg was " << arg << std::endl;
}
// normal code here...
}
I can look at the disassembly, sure. But being an upstream platform software it's hard to control all targets, compilers and their options one might use. The fear is big that due to any reason a downstream project has a code bloat or performance issue.
Bottom line: Is it possible to write the software in a way, that certain code is known to be optimized away in a safe manner as a #if would do? Or a unit tests, which give a fail if optimization is not as expected?
[Timing tests come to my mind for the first problem, but being bare-metal I don't have convenient tools yet.]

There may be a more elegant way, and it's not a unit test, but if you're just looking for that particular string, and you can make it unique,
strings $COMPILED_BINARY | grep "Arg was"
should show you if the string is being included

if constexpr is the canonical C++ expression (since C++17) for this kind of test.
constexpr bool DEBUG = /*...*/;
int main() {
if constexpr(DEBUG) {
std::cerr << "We are in debugging mode!" << std::endl;
}
}
If DEBUG is false, then the code to print to the console won't generate at all. So if you have things like log statements that you need for checking the behavior of your code, but which you don't want to interact with in production code, you can hide them inside if constexpr expressions to eliminate the code entirely once the code is moved to production.

Looking at your question, I see several (sub-)questions in it that require an answer. Not all answers might be possible with your bare-metal compilers as hardware vendors don't care that much about C++.
The first question is: How do I write code in a way that I'm sure it gets optimized. The obvious answer here is to put everything in a single compilation unit so the caller can see the implementation.
The second question is: How can I force a compiler to optimize. Here constexpr is a bless. Depending on whether you have support for C++11, C++14, C++17 or even the upcoming C++20, you'll get different feature sets of what you can do in a constexpr function. For the usage:
constexpr char c = std::string_view{"my_very_long_string"}[7];
With the code above, c is defined as a constexpr variable. Because you apply it to the variable, you require some things:
Your compiler should optimize the code so the value of c is known at compile time. This even holds true for -O0 builds!
All functions used for calculate c are constexpr and available. (and by result, enforce the behaviour of the first question)
No undefined behaviour is allowed to be triggered in the calculation of c. (For the given value)
The negative about this is: Your input needs to be known at compile time.
C++17 also provides if constexpr which has similar requirements: condition needs to be calculated at compile time. The result is that 1 branch of the code ain't allowed to be compiled (as it even can contain elements that don't work on the type you are using).
Which than brings us to the question: How do I ensure sufficient optimizations for my program to run fast enough, even if my compiler ain't well behaving. Here the only relevant answer is: create benchmarks and compare the results. Take the effort to setup a CI job that automates this for you. (And yes, you can even use external hardware although not being that easy) In the end, you have some requirements: handling A should take less than X seconds. Do A several times and time it. Even if they don't handle everything, as long as it's within the requirements, its fine.
Note: As this is about debug, you most likely can track the size of an executable as well. As soon as you start using streams, a lot of conversions to string ... your exe size will grow. (And you'll find it a bless as you will immediately find commits which add 10% to the image size)
And than the final question: You have a buggy compiler, it doesn't meet my requirements. Here the only answer is: Replace it. In the end, you can use any compiler to compiler your code to bare metal, as long as the linker scripts work. If you need a start, C++Now 2018: Michael Caisse “Modern C++ in Embedded Systems” gives you a very good idea of what you need to use a different compiler. (Like a recent Clang or GCC, on which you even can log bugs if the optimization ain't good enough)

Insert a reference to external data or function into the block that should be verified to be optimised away. Like this:
extern void nop();
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
nop();
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
In Debug-Builds, link with an implementation of nop() in a extra compilation unit nop.cpp:
void nop() {}
In Release-Builds, don't provide an implementation.
Release builds will only link if the optimisable code is eliminated.
`- kisch

Here's another nice solution using inline assembly.
This uses assembler directives only, so it might even be kind of portable (checked with clang).
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm(".globl _marker\n_marker:\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
This would leave an exported linker symbol in the compiled executable, if the code isn't optimised away. You can check for this symbol using nm(1).
clang can even stop the compilation right away:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm("_marker=1\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
asm volatile (
".ifdef _marker\n"
".err \"code not optimised away\"\n"
".endif\n"
);
// normal code here...
}

This is not an answer to "How to ensure some code is optimized away?" but to your summary line "Can a unit test be written that e.g. whole loops are optimized away?".
First, the answer depends on how far you see the scope of unit-testing - so if you put in performance tests, you might have a chance.
If in contrast you understand unit-testing as a way to test the functional behaviour of the code, you don't. For one thing, optimizations (if the compiler works correctly) shall not change the behaviour of standard-conforming code.
With incorrect code (code that has undefined behaviour) optimizers can do what they want. (Well, for code with undefined behaviour the compiler can do it also in the non-optimizing case, but sometimes only the deeper analyses peformed during optimization make it possible for the compiler to detect that some code has undefined behaviour.) Thus, if you write unit-tests for some piece of code with undefined behaviour, the test results may differ when you run the tests with and without optimization. But, strictly speaking, this only tells you that the compiler translated the code both times in a different way - it does not guarantee you that the code is optimized in the way you want it to be.

Here's another different way that also covers the first example.
You can verify (at runtime) that the code has been eliminated, by comparing two labels placed around it.
This relies on the GCC extension "Labels as Values" https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
before:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
behind:
if (intptr_t(&&behind) != intptr_t(&&before)) abort();
It would be nice if you could check this in a static_assert(), but sadly the difference of &&label expressions is not accepted as compile-time constant.
GCC insists on inserting a runtime comparison, even though both labels are in fact at the same address.
Interestingly, if you compare the addresses (type void*) directly, without casting them to intptr_t, GCC falsely optimises away the if() as "always true", whereas clang correctly optimises away the complete if() as "always false", even at -O1.

Related

c/c++ using else instead of if not

Encountered a bit of weird-looking code and it got me wondering if there's any practical application to it, or if it's just a random oddity.
The code essentially looks like this:
#ifdef PREPROCESSOR_CONDITION
if (runtime_condition) {
} else
#endif
{
//expression
}
I included the macro bit, though I doubt it has any bearing. There's no code that runs when runtime_condition is true, only the else block. I figure this ought to be completely identical to using if(!runtime_condition) and no else block (which would have been more straightforward), but maybe there's some kind of compiler-optimizey thing happening?
Or, you know, it could be that there used to be something in the if block that got deleted and nobody bothered to change the expression.

The "macro bit" is significant.
Consider what happens if the programmer mistakenly changes the snippet to
#ifdef PREPROCESSOR_CONDITION
if (runtime_condition)
{
}
else
#endif
{
//expression
}
else
{
// another expression
}
This will result in a compilation error, regardless of whether PREPROCESSOR_CONDITION is defined or not.
With your change, viz;
#ifdef PREPROCESSOR_CONDITION
if (!runtime_condition)
#endif
{
//expression
}
else
{
// another expression
}
it will compile if PREPROCESSOR_CONDITION is defined, but fail if it is not defined.
If the programmer who adds the else only attempts compilation in conditions with PREPROCESSOR_CONDITION defined, no problem will be found. There will then be a latent defect in the code, that will not be exposed until the code is compiled with PREPROCESSOR_CONDITION undefined.
This may seem minor in a single code snippet, but compilation errors occurring by surprise (e.g. code breaking in unexpected places) is a significant productivity concern if it occurs in larger projects.

Ensuring the conditional definitely always has two branches (on those occasions when it is compiled as a conditional) guarantees that someone who works on a platform that always has PREPROCESSOR_CONDITION defined won't absent-mindedly add a real else block below, which clearly isn't how this block is supposed to work (or worse, "fix" it so that their code does compile everywhere, but in the process damage whatever the intent of the original author was in constructing their blocks that way).
If that is the intent, it would normally make sense to communicate it explicitly by hiding the exact details of the if-empty-else behind a macro named something like UNLESS, though.
It definitely isn't anything to do with optimization. If you think in terms of the jumps involved, a compiler already has to invert the value of the predicate to decide whether to skip the block or not (i.e. if (A) {B;} C:... really means if (!A) goto C; B; C:...), which means it folds the hand-written ! at the outermost level of the expression into the conditional's structure anyway; since most instruction sets will provide both jump-if-true and jump-if-not-true instructions, doing this is completely free, and both ways to write "if not" in the source will produce the same machine code on even a very simple compiler without any optimization.

Do GCC/Clang allow to access static member through null pointer?

#include <iostream>
struct Foo { static auto foo() -> int { return 123; } };
int main() {
std::cout << static_cast<Foo*>(nullptr)->foo() << std::endl;
return 0;
}
I DO KNOW that this is not allowed by standard. But, how about specific compiler?
I only care GCC(G++) and Clang.
Is there any guarantee that these two compiler allow this as compiler feature/specification/extension?

I can find this neither in the list of gcc's C++ extensions nor in the corresponding list for clang. So I would say that you have no guarantee that this works for either compiler.

There is nothing saying that this will work. It is up to you, if you decide to use it, to ascertain that it really does work for the actual architecture in which you are using it. And it's important to reflect on the fact that code generation, debug traps for null pointer usage (and such) and optimisation in all compilers is "target system dependent" - so it may well work on x86, but not on MIPS, as an example. Or it may stop working in one way or another in version X+0.0.1 of the compiler.
Having said that, I'd expect this particular example to work perfectly fine on any architecture, because this is not used anywhere.
Note that just because something is undefined, doesn't necessarily mean that it will crash, fail or even "not work exactly as you expect it to". It does, however, allow the compiler to generate whatever code the compiler would like, including something that blows up your computer, formats your hard-drive, etc, etc.

C++ use templates to avoid compiler from checking a boolean

Let's say I have a function:
template <bool stuff>
inline void doSomething() {
if(stuff) {
cout << "Hello" << endl;
}
else {
cout << "Goodbye" << endl;
}
}
And I call it like this:
doSomething<true>();
doSomething<false>();
It would pring out:
Hello
Goodbye
What I'm really wondering is does the compiler fully optimize this?
When I call the templated function with true, will it create a function that just outputs "Hello" and avoids the if statement and the code for "Goodbye"?
This would be really useful for this one giant function I just wrote that's supposed to be very optimized and avoid as many unnecessary if statement checks as possible. I have a very good feeling it would, at least in a release build with optimizations if not in a debug build with no optimizations.

Disclaimer: Noone can guarantee anything.
That said, this an obvious and easy optimization for any compiler. It's quite safe to say that it will be optimized away, unless the optimizer is, well, practically useless.
Since your "true" and "false" are constants, you are unambiguously creating an obvious dead branch in each class, and the compiler should optimize it away. Should is taken literally here - I would consider it a major, major problem if an "optimising" compiler did not do dead branch removal.
In other words, if your compiler cannot optimize this, it is the use of that compiler that should be evaluated, not the code.
So, I would say your gut feeling is correct: while yes, no "guarantees" as such can be made on each and every compiler, I would not use a compiler incapable of performing simplistic optimizations in any production environment, and of course not in any performance critical one. (In release builds of course).
So, use it. Any modern optimizing compiler will optimize it away because it is a trivial optimization. If in doubt, check disassembly, and if it is not optimized, change the compiler to something more modern.
In general, if you are writing any kind of performance-critical code, you must rely, at least to some extent, on compiler optimizations.

This is inherently up to the compiler, so you'd have to check the compiler's documentation or the generated code. But in simple cases like this, you can easily implement the optimization yourself:
template <bool stuff>
inline void doSomething();
template<>
inline void doSomething<true>() {
cout << "Hello" << endl;
}
template<>
inline void doSomething<false>() {
cout << "Goodbye" << endl;
}
But "optimization" isn't really the right word to use since this might actually degrade performance. It's only an optimization if it actually benefits your code performance.

Indeed, it really createa two functions, but
premature optimization is the root of all evil
especially if your changing your code structure because of a simple if statement. I doubt that this will affect performance. Also the boolean must be static, that means you cant take a runtime evaluated var and pass it to the function. How should the linker know which function to call? In this case youll have to manually evaluate it and call the appropiate function on your own.

Compilers are really good at constant folding. That is, in this case it would surprise me if the check would stay until after optimization. A non-optimized build might still have the check. The easiest way to verify is to create assembler output and check.
That said, it is worth noting that the compiler has to check both branches for correctness, even if it only ever uses one branch. This frequently shows up, e.g., when using slightly different algorithms for Random Access Iterators and other iterators. The condition would depend on a type-trait and one of the branches may fail to compile depending on operations tested for by the traits. The committee has discussed turning off this checking under the term static if although there is no consensus, yet, on how the features would look exactly (if it gets added).

If I understand you correctly you want (in essence) end up with 'two' functions that are optimised for either a true or a false input so that they don't need check that flag?
Aside from any trivial optimisation that may yield (I'm against premature otimisation - I believe in maintainability before measurement before optimisation), I would say why not refactor your function to actually be two functions? If they have common code then then that code could be refactored out too. However if the requirement is such that the refactoring is non optimal then I'd replace that with a #define refactoring.

Compiler instruction reordering optimizations in C++ (and what inhibits them)

I've reduced my code down to the following, which is as simple as I could make it whilst retaining the compiler output that interests me.
void foo(const uint64_t used)
{
uint64_t ar[100];
for(int i = 0; i < 100; ++i)
{
ar[i] = some_global_array[i];
}
const uint64_t mask = ar[0];
if((used & mask) != 0)
{
return;
}
bar(ar); // Not inlined
}
Using VC10 with /O2 and /Ob1, the generated assembly pretty much reflects the order of instructions in the above C++ code. Since the local array ar is only passed to bar() when the condition fails, and is otherwise unused, I would have expected the compiler to optimize to something like the following.
if((used & some_global_array[0]) != 0)
{
return;
}
// Now do the copying to ar and call bar(ar)...
Is the compiler not doing this because it's simply too hard for it to identify such optimizations in the general case? Or is it following some strict rule that forbids it from doing so? If so, why, and is there some way I can give it a hint that doing so wouldn't change the semantics of my program?
Note: obviously it would be trivial to obtain the optimized output by just rearranging the code, but I'm interested in why the compiler won't optimize in such cases, not how to do so in this (intentionally simplified) case.

Probably the reason why this is not getting optimized is the global array. The compiler can't know beforehand if, say, accessing some_global_array[99] will result in some kind of exception/signal being generated so it has to execute the whole loop. Things would be pretty different if the global array was statically defined in the same compilation unit.
For example, in LLVM, the following three definitions of the global array will yield wildly differing outputs of that function:
// this yields pretty much what you're seeing
uint64_t *some_global_array;
// this calls memcpy and then performs the conditional check
uint64_t some_global_array[100] = {0};
// this calls memset (not memcpy!) on the ar array and then bar directly (no
// conditional checks since the array is const and filled with 0s, so the if
// is always false)
const uint64_t some_global_array[100] = {0};
The second is pretty puzzling, but it may simply be a missed optimization (or maybe I'm missing something else).

There are no "strict rules" controlling what kind of assembly language the compiler is permitted to output. If the compiler can be certain that a block of code does not need to be executed (because it has no side effects) due to some precondition, then it is absolutely permitted to short-circuit the whole thing.
This sort of optimisation can be fairly complex in the general case, and your compiler might not go to all that effort. If this is performance critical code, then you can fine-tune your source code (as you suggest) to help the compiler generate the best assembly code. This is a trial-and-error process though, and you might have to do it again for the next version of the compiler.

Why is linker optimization so poor?

Recently, a coworker pointed out to me that compiling everything into a single file created much more efficient code than compiling separate object files - even with link time optimization turned on. In addition, the total compile time for the project went down significantly. Given that one of the primary reasons for using C++ is code efficiency, this was surprising to me.
Clearly, when the archiver/linker makes a library out of object files, or links them into an executable, even simple optimizations are penalized. In the example below, trivial inlining costs 1.8% in performance when done by the linker instead of the compiler. It seems like compiler technology should be advanced enough to handle fairly common situations like this, but it isn't happening.
Here is a simple example using Visual Studio 2008:
#include <cstdlib>
#include <iostream>
#include <boost/timer.hpp>
using namespace std;
int foo(int x);
int foo2(int x) { return x++; }
int main(int argc, char** argv)
{
boost::timer t;
t.restart();
for (int i=0; i<atoi(argv[1]); i++)
foo (i);
cout << "time : " << t.elapsed() << endl;
t.restart();
for (int i=0; i<atoi(argv[1]); i++)
foo2 (i);
cout << "time : " << t.elapsed() << endl;
}
foo.cpp
int foo (int x) { return x++; }
Results of run: 1.8% performance hit to using linked foo instead of inline foo2.
$ ./release/testlink.exe 100000000
time : 13.375
time : 13.14
And yes, the linker optimization flags (/LTCG) are on.

Your coworker is out of date. The technology has been here since 2003 (on the MS C++ compiler): /LTCG. Link time code generation is dealing with exactly this problem. From what I know the GCC has this feature on the radar for the next generation compiler.
LTCG does not only optimize the code like inlining functions across modules, but actually rearanges code to optimize cache locality and branching for a specific load, see Profile-Guided Optimizations. These options are usualy reserved only for Release builds as the build can take hours to finish: will link a instrumented executable, run a profiling load and then link again with the profiling results. The link contains details about what exactly is optimized with LTCG:
Inlining – For example, if there
exists a function A that frequently
calls function B, and function B is
relatively small, then profile-guided
optimizations will inline function B
in function A.
Virtual Call Speculation – If a
virtual call, or other call through a
function pointer, frequently targets a
certain function, a profile-guided
optimization can insert a
conditionally-executed direct call to
the frequently-targeted function, and
the direct call can be inlined.
Register Allocation – Optimizing with
profile data results in better
register allocation.
Basic Block Optimization – Basic block
optimization allows commonly executed
basic blocks that temporally execute
within a given frame to be placed in
the same set of pages (locality). This
minimizes the number of pages used,
thus minimizing memory overhead.
Size/Speed Optimization – Functions
where the program spends a lot of time
can be optimized for speed.
Function Layout – Based on the call
graph and profiled caller/callee
behavior, functions that tend to be
along the same execution path are
placed in the same section.
Conditional Branch Optimization – With
the value probes, profile-guided
optimizations can find if a given
value in a switch statement is used
more often than other values. This
value can then be pulled out of the
switch statement. The same can be done
with if/else instructions where the
optimizer can order the if/else so
that either the if or else block is
placed first depending on which block
is more frequently true.
Dead Code Separation – Code that is
not called during profiling is moved
to a special section that is appended
to the end of the set of sections.
This effectively keeps this section
out of the often-used pages.
EH Code Separation – The EH code,
being exceptionally executed, can
often be moved to a separate section
when profile-guided optimizations can
determine that the exceptions occur
only on exceptional conditions.
Memory Intrinsics – The expansion of
intrinsics can be decided better if it
can be determined if an intrinsic is
called frequently. An intrinsic can
also be optimized based on the block
size of moves or copies.

I'm not a compiler specialist, but I think the compiler has much more information available at disposal to optimize as it operates on a language tree, as opposed to the linker that has to content itself to operate on the object output, far less expressive than the code the compiler has seen. Hence less effort is spent by linker and compiler development team(s) into making linker optimizations that could match, in theory, the tricks the compiler does.
BTW, I'm sorry I distracted your original question into the ltcg discussion. I now understand your question was a little bit different, more concerned with the link time vs. compile time static optimizations possible/available.

Your coworker is smarter than most of us. Even if it seems a crude approach at first, project inlining into a single .cpp file has one thing that the other approaches like link-time-optimization does not have and will not have for while - reliability
However, you asked this two years ago, and I testify that a lot has changed since then (with g++ at least). Devirtualization is a lot more reliable, for instance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js