C++ use templates to avoid compiler from checking a boolean

C++ use templates to avoid compiler from checking a boolean - c++

Let's say I have a function:
template <bool stuff>
inline void doSomething() {
if(stuff) {
cout << "Hello" << endl;
}
else {
cout << "Goodbye" << endl;
}
}
And I call it like this:
doSomething<true>();
doSomething<false>();
It would pring out:
Hello
Goodbye
What I'm really wondering is does the compiler fully optimize this?
When I call the templated function with true, will it create a function that just outputs "Hello" and avoids the if statement and the code for "Goodbye"?
This would be really useful for this one giant function I just wrote that's supposed to be very optimized and avoid as many unnecessary if statement checks as possible. I have a very good feeling it would, at least in a release build with optimizations if not in a debug build with no optimizations.

Disclaimer: Noone can guarantee anything.
That said, this an obvious and easy optimization for any compiler. It's quite safe to say that it will be optimized away, unless the optimizer is, well, practically useless.
Since your "true" and "false" are constants, you are unambiguously creating an obvious dead branch in each class, and the compiler should optimize it away. Should is taken literally here - I would consider it a major, major problem if an "optimising" compiler did not do dead branch removal.
In other words, if your compiler cannot optimize this, it is the use of that compiler that should be evaluated, not the code.
So, I would say your gut feeling is correct: while yes, no "guarantees" as such can be made on each and every compiler, I would not use a compiler incapable of performing simplistic optimizations in any production environment, and of course not in any performance critical one. (In release builds of course).
So, use it. Any modern optimizing compiler will optimize it away because it is a trivial optimization. If in doubt, check disassembly, and if it is not optimized, change the compiler to something more modern.
In general, if you are writing any kind of performance-critical code, you must rely, at least to some extent, on compiler optimizations.

This is inherently up to the compiler, so you'd have to check the compiler's documentation or the generated code. But in simple cases like this, you can easily implement the optimization yourself:
template <bool stuff>
inline void doSomething();
template<>
inline void doSomething<true>() {
cout << "Hello" << endl;
}
template<>
inline void doSomething<false>() {
cout << "Goodbye" << endl;
}
But "optimization" isn't really the right word to use since this might actually degrade performance. It's only an optimization if it actually benefits your code performance.

Indeed, it really createa two functions, but
premature optimization is the root of all evil
especially if your changing your code structure because of a simple if statement. I doubt that this will affect performance. Also the boolean must be static, that means you cant take a runtime evaluated var and pass it to the function. How should the linker know which function to call? In this case youll have to manually evaluate it and call the appropiate function on your own.

Compilers are really good at constant folding. That is, in this case it would surprise me if the check would stay until after optimization. A non-optimized build might still have the check. The easiest way to verify is to create assembler output and check.
That said, it is worth noting that the compiler has to check both branches for correctness, even if it only ever uses one branch. This frequently shows up, e.g., when using slightly different algorithms for Random Access Iterators and other iterators. The condition would depend on a type-trait and one of the branches may fail to compile depending on operations tested for by the traits. The committee has discussed turning off this checking under the term static if although there is no consensus, yet, on how the features would look exactly (if it gets added).

If I understand you correctly you want (in essence) end up with 'two' functions that are optimised for either a true or a false input so that they don't need check that flag?
Aside from any trivial optimisation that may yield (I'm against premature otimisation - I believe in maintainability before measurement before optimisation), I would say why not refactor your function to actually be two functions? If they have common code then then that code could be refactored out too. However if the requirement is such that the refactoring is non optimal then I'd replace that with a #define refactoring.

Related

How to ensure some code is optimized away?

tl;dr: Can it be ensured somehow (e.g. by writing a unit test) that some things are optimized away, e.g. whole loops?
The usual approach to be sure that something is not included in the production build is wrapping it with #if...#endif. But I prefer to stay with C++ mechanics instead. Even there, instead of complicated template specializations I like to keep implementations simple and argue "hey, the compiler will optimize this out anyway".
Context is embedded SW in automotive (binary size matters) with often poor compilers. They are certified in the sense of safety, but usually not good in optimizations.
Example 1: In a container the destruction of elements is typically a loop:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
This works also for build-in types such as int, as the standard allows the explicit call of the destructor also for any scalar types (C++11 12.4-15). In such case the loop does nothing and is optimized out. In GCC it is, but in another (Aurix) not, I saw a literally empty loop in the disassembly! So that needed a template specialization to fix it.
Example 2: Code, which is intended for debugging, profiling or fault-injection etc. only:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
// Albeit 'dead' section, it may not appear in production binary!
// (size, security, safety...)
// 'if constexpr..' not an option (C++11)
std::cout << "Arg was " << arg << std::endl;
}
// normal code here...
}
I can look at the disassembly, sure. But being an upstream platform software it's hard to control all targets, compilers and their options one might use. The fear is big that due to any reason a downstream project has a code bloat or performance issue.
Bottom line: Is it possible to write the software in a way, that certain code is known to be optimized away in a safe manner as a #if would do? Or a unit tests, which give a fail if optimization is not as expected?
[Timing tests come to my mind for the first problem, but being bare-metal I don't have convenient tools yet.]

There may be a more elegant way, and it's not a unit test, but if you're just looking for that particular string, and you can make it unique,
strings $COMPILED_BINARY | grep "Arg was"
should show you if the string is being included

if constexpr is the canonical C++ expression (since C++17) for this kind of test.
constexpr bool DEBUG = /*...*/;
int main() {
if constexpr(DEBUG) {
std::cerr << "We are in debugging mode!" << std::endl;
}
}
If DEBUG is false, then the code to print to the console won't generate at all. So if you have things like log statements that you need for checking the behavior of your code, but which you don't want to interact with in production code, you can hide them inside if constexpr expressions to eliminate the code entirely once the code is moved to production.

Looking at your question, I see several (sub-)questions in it that require an answer. Not all answers might be possible with your bare-metal compilers as hardware vendors don't care that much about C++.
The first question is: How do I write code in a way that I'm sure it gets optimized. The obvious answer here is to put everything in a single compilation unit so the caller can see the implementation.
The second question is: How can I force a compiler to optimize. Here constexpr is a bless. Depending on whether you have support for C++11, C++14, C++17 or even the upcoming C++20, you'll get different feature sets of what you can do in a constexpr function. For the usage:
constexpr char c = std::string_view{"my_very_long_string"}[7];
With the code above, c is defined as a constexpr variable. Because you apply it to the variable, you require some things:
Your compiler should optimize the code so the value of c is known at compile time. This even holds true for -O0 builds!
All functions used for calculate c are constexpr and available. (and by result, enforce the behaviour of the first question)
No undefined behaviour is allowed to be triggered in the calculation of c. (For the given value)
The negative about this is: Your input needs to be known at compile time.
C++17 also provides if constexpr which has similar requirements: condition needs to be calculated at compile time. The result is that 1 branch of the code ain't allowed to be compiled (as it even can contain elements that don't work on the type you are using).
Which than brings us to the question: How do I ensure sufficient optimizations for my program to run fast enough, even if my compiler ain't well behaving. Here the only relevant answer is: create benchmarks and compare the results. Take the effort to setup a CI job that automates this for you. (And yes, you can even use external hardware although not being that easy) In the end, you have some requirements: handling A should take less than X seconds. Do A several times and time it. Even if they don't handle everything, as long as it's within the requirements, its fine.
Note: As this is about debug, you most likely can track the size of an executable as well. As soon as you start using streams, a lot of conversions to string ... your exe size will grow. (And you'll find it a bless as you will immediately find commits which add 10% to the image size)
And than the final question: You have a buggy compiler, it doesn't meet my requirements. Here the only answer is: Replace it. In the end, you can use any compiler to compiler your code to bare metal, as long as the linker scripts work. If you need a start, C++Now 2018: Michael Caisse “Modern C++ in Embedded Systems” gives you a very good idea of what you need to use a different compiler. (Like a recent Clang or GCC, on which you even can log bugs if the optimization ain't good enough)

Insert a reference to external data or function into the block that should be verified to be optimised away. Like this:
extern void nop();
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
nop();
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
In Debug-Builds, link with an implementation of nop() in a extra compilation unit nop.cpp:
void nop() {}
In Release-Builds, don't provide an implementation.
Release builds will only link if the optimisable code is eliminated.
`- kisch

Here's another nice solution using inline assembly.
This uses assembler directives only, so it might even be kind of portable (checked with clang).
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm(".globl _marker\n_marker:\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
This would leave an exported linker symbol in the compiled executable, if the code isn't optimised away. You can check for this symbol using nm(1).
clang can even stop the compilation right away:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm("_marker=1\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
asm volatile (
".ifdef _marker\n"
".err \"code not optimised away\"\n"
".endif\n"
);
// normal code here...
}

This is not an answer to "How to ensure some code is optimized away?" but to your summary line "Can a unit test be written that e.g. whole loops are optimized away?".
First, the answer depends on how far you see the scope of unit-testing - so if you put in performance tests, you might have a chance.
If in contrast you understand unit-testing as a way to test the functional behaviour of the code, you don't. For one thing, optimizations (if the compiler works correctly) shall not change the behaviour of standard-conforming code.
With incorrect code (code that has undefined behaviour) optimizers can do what they want. (Well, for code with undefined behaviour the compiler can do it also in the non-optimizing case, but sometimes only the deeper analyses peformed during optimization make it possible for the compiler to detect that some code has undefined behaviour.) Thus, if you write unit-tests for some piece of code with undefined behaviour, the test results may differ when you run the tests with and without optimization. But, strictly speaking, this only tells you that the compiler translated the code both times in a different way - it does not guarantee you that the code is optimized in the way you want it to be.

Here's another different way that also covers the first example.
You can verify (at runtime) that the code has been eliminated, by comparing two labels placed around it.
This relies on the GCC extension "Labels as Values" https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
before:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
behind:
if (intptr_t(&&behind) != intptr_t(&&before)) abort();
It would be nice if you could check this in a static_assert(), but sadly the difference of &&label expressions is not accepted as compile-time constant.
GCC insists on inserting a runtime comparison, even though both labels are in fact at the same address.
Interestingly, if you compare the addresses (type void*) directly, without casting them to intptr_t, GCC falsely optimises away the if() as "always true", whereas clang correctly optimises away the complete if() as "always false", even at -O1.

Micro optimization - compiler optimization when accesing recursive members

I'm interested in writing good code from the beginning instead of optimizing the code later. Sorry for not providing benchmark I don't have a working scenario at the moment. Thanks for your attention!
What are the performance gains of using FunctionY over FunctionX?
There is a lot of discussion on stackoverflow about this already but I'm in doubts in the case when accessing sub-members (recursive) as shown below. Will the compiler (say VS2008) optimize FunctionX into something like FunctionY?
void FunctionX(Obj * pObj)
{
pObj->MemberQ->MemberW->MemberA.function1();
pObj->MemberQ->MemberW->MemberA.function2();
pObj->MemberQ->MemberW->MemberB.function1();
pObj->MemberQ->MemberW->MemberB.function2();
..
pObj->MemberQ->MemberW->MemberZ.function1();
pObj->MemberQ->MemberW->MemberZ.function2();
}
void FunctionY(Obj * pObj)
{
W * localPtr = pObj->MemberQ->MemberW;
localPtr->MemberA.function1();
localPtr->MemberA.function2();
localPtr->MemberB.function1();
localPtr->MemberB.function2();
...
localPtr->MemberZ.function1();
localPtr->MemberZ.function2();
}

In case none of the member pointers are volatile or pointers to volatile and you don't have the operator -> overloaded for any members in a chain both functions are the same.
The optimization rule you suggested is widely known as Common Expression Elimination and is supported by vast majority of compilers for many decades.

In theory, you save on the extra pointer dereferences, HOWEVER, in the real world, the compiler will probably optimize it out for you, so it's a useless optimization.
This is why it's important to profile first, and then optimize later. The compiler is doing everything it can to help you, you might as well make sure you're not just doing something it's already doing.

if the compiler is good enough, it should translate functionX into something similar to functionY.
But you can have different result on different compiler and on the same compiler with different optimization flag.
Using a "dumb" compiler functionY should be faster, and IMHO it is more readable and faster to code. So stick with functionY
ps. you should take a look at some code style guide, normally member and function name should always start with a low-case letter

c++ optimization

I'm working on some existing c++ code that appears to be written poorly, and is very frequently called. I'm wondering if I should spend time changing it, or if the compiler is already optimizing the problem away.
I'm using Visual Studio 2008.
Here is an example:
void someDrawingFunction(....)
{
GetContext().DrawSomething(...);
GetContext().DrawSomething(...);
GetContext().DrawSomething(...);
.
.
.
}
Here is how I would do it:
void someDrawingFunction(....)
{
MyContext &c = GetContext();
c.DrawSomething(...);
c.DrawSomething(...);
c.DrawSomething(...);
.
.
.
}

Don't guess at where your program is spending time. Profile first to find your bottlenecks, then optimize those.
As for GetContext(), that depends on how complex it is. If it's just returning a class member variable, then chances are that the compiler will inline it. If GetContext() has to perform a more complicated operation (such as looking up the context in a table), the compiler probably isn't inlining it, and you may wish to only call it once, as in your second snippet.
If you're using GCC, you can also tag the GetContext() function with the pure attribute. This will allow it to perform more optimizations, such as common subexpression elimination.

If you're sure it's a performance problem, change it. If GetContext is a function call (as opposed to a macro or an inline function), then the compiler is going to HAVE to call it every time, because the compiler can't necessarily see what it's doing, and thus, the compiler probably won't know that it can eliminate the call.
Of course, you'll need to make sure that GetContext ALWAYS returns the same thing, and that this 'optimization' is safe.

If it is logically correct to do it the second way, i.e. calling GetContext() once on multiple times does not affect your program logic, i'd do it the second way even if you profile it and prove that there are no performance difference either way, so the next developer looking at this code will not ask the same question again.

Obviously, if GetContext() has side effects (I/O, updating globals, etc.) than the suggested optimization will produce different results.
So unless the compiler can somehow detect that GetContext() is pure, you should optimize it yourself.

If you're wondering what the compiler does, look at the assembly code.

That is such a simple change, I would do it.
It is quicker to fix it than to debate it.
But do you actually have a problem?
Just because it's called often doesn't mean it's called TOO often.
If it seems qualitatively piggy, sample it to see what it's spending time at.
Chances are excellent that it is not what you would have guessed.

Is putting a function within an if statement efficient? (C++)

I've seen statements like this
if(SomeBoolReturningFunc())
{
//do some stuff
//do some more stuff
}
and am wondering if putting a function in an if statement is efficient, or if there are cases when it would be better to leave them separate, like this
bool AwesomeResult = SomeBoolReturningFunc();
if(AwesomeResult)
{
//do some other, more important stuff
}
...?

I'm not sure what makes you think that assigning the result of the expression to a variable first would be more efficient than evaluating the expression itself, but it's never going to matter, so choose the option that enhances the readability of your code. If you really want to know, look at the output of your compiler and see if there is any difference. On the vast majority of systems out there this will likely result in identical machine code.

either way it should not matter. the basic idea is that the result will be stored in a temporary variable no matter what, whether you name it or not. Readability is more important nowadays because computers are normally so fast that small tweaks don't matter as much.

I've certainly seen if (f()) {blah;} produce more efficient code than bool r = f(); if (r) {blah;}, but that was many years ago on a 68000.
These days, I'd definitely pick the code that's simpler to debug. Your inefficiencies are far more likely to be your algorithm vs. the code your compiler generates.

as others have said, basically makes no real difference in performance, generally in C++ most performance gains at a pure code level (as opposed to algorithm) are made in loop unrolling.
also, avoiding branching altogether can give way more performance if the condition is in a loop.
for instance by having separate loops for each conditional case or by having a statements that inherently take into account the condition (perhaps by multiplying a term by 0 if its not wanted)
and then you can get more by unrolling that loop
templating your code can help a lot with this in a "semi" clean way.

There is also the possibility of
if (bool AwesomeResult = SomeBoolRetuningFunc()) { ... }
:)
IMO, the kind of statements you have seen are clearer to read, and less error prone:
bool result = Foo();
if (result) {
//some stuff
}
bool other_result = Bar();
if (result) { //should be other_result? Hopefully caught by "unused variable" warning...
//more stuff
}

Both variants usually produce the same machine code and run exactly the same. Very rarely there will a performance difference and even in this cases it will unlikely be a bottleneck (which translates into don't bother prior to profiling).
The significant difference is in debugging and readability. With a temporary variable it's easier to debug. Without the variable the code is shorter and perhaps more readable.
If you want both easy to debug and easier to read code you should better declare the variable as const:
const bool AwesomeResult = SomeBoolReturningFunc();
if(AwesomeResult)
{
//do some other, more important stuff
}
this way it's clearer that the variable is never assigned to again and there's no other logic behind its declaration.

Putting aside any debugging ease or readability problems, and as long as the function's returned value is not used again in the if-block; it seems to me that assigning the returned value to a variable only causes an extra use of the = operator and an extra bool variable stored in the stack space - I might speculate further that an extra variable in the stack space will cause latency in further stack accesses (not sure though).
The thing is, these are really minor problems and as long as the compiler has an optimization flag on, should cause no inefficiency. A different case I'd consider would be an embedded system - then again, how much damage could a single, 8-bit variable cause? (I have absolutely no knowledge concerning embedded systems, so maybe someone else could elaborate on this?)

The AwesomeResult version can be faster if SomeBoolReturningFunc() is fairly slow and you are able to use AwesomeResult more than once rather than calling SomeBoolReturningFunc() again.

Placing the function inside or outside the if-statement doesn't matter. There is no performance gain or loss. This is because the compiler will automatically create a place on the stack for the return value - whether or not you've explicitly defined a variable.

In the performance tuning I've done, this would only be thought about in the final stage of cycle-shaving after cleaning out a series of significant performance issues.

I seem to recall that one statement per line was the recommendation of the book Code Complete, where it was argued that such code is easier to understand. Make each statement do one and only one thing so that it is very easy to see very quickly at a glance what is happening in the code.
I personally like having the return types in variables to make them easier to inspect (or even change) in the debugger.
One answer stated that a difference was observed in generated code. I sincerely doubt that with an optimizing compiler.
A disadvantage, prior to C++11, for multiple lines is that you have to know the type of the return for the variable declaration. For example, if the return is changed from bool to int then depending on the types involved you could have a truncated value in the local variable (which could make the if malfunction). If compiling with C++11 enabled, this can be dealt with by using the auto keyword, as in:
auto AwesomeResult = SomeBoolReturningFunc() ;
if ( AwesomeResult )
{
//do some other, more important stuff
}
Stroustrup's C++ 4th edition recommends putting the variable declaration in the if statement itself, as in:
if ( auto AwesomeResult = SomeBoolReturningFunc() )
{
//do some other, more important stuff
}
His argument is that this limits the scope of the variable to the greatest extent possible. Whether or not that is more readable (or debuggable) is a judgement call.

When should I use __forceinline instead of inline?

Visual Studio includes support for __forceinline. The Microsoft Visual Studio 2005 documentation states:
The __forceinline keyword overrides
the cost/benefit analysis and relies
on the judgment of the programmer
instead.
This raises the question: When is the compiler's cost/benefit analysis wrong? And, how am I supposed to know that it's wrong?
In what scenario is it assumed that I know better than my compiler on this issue?

You know better than the compiler only when your profiling data tells you so.

The one place I am using it is licence verification.
One important factor to protect against easy* cracking is to verify being licenced in multiple places rather than only one, and you don't want these places to be the same function call.
*) Please don't turn this in a discussion that everything can be cracked - I know. Also, this alone does not help much.

The compiler is making its decisions based on static code analysis, whereas if you profile as don says, you are carrying out a dynamic analysis that can be much farther reaching. The number of calls to a specific piece of code is often largely determined by the context in which it is used, e.g. the data. Profiling a typical set of use cases will do this. Personally, I gather this information by enabling profiling on my automated regression tests. In addition to forcing inlines, I have unrolled loops and carried out other manual optimizations on the basis of such data, to good effect. It is also imperative to profile again afterwards, as sometimes your best efforts can actually lead to decreased performance. Again, automation makes this a lot less painful.
More often than not though, in my experience, tweaking alogorithms gives much better results than straight code optimization.

I've developed software for limited resource devices for 9 years or so and the only time I've ever seen the need to use __forceinline was in a tight loop where a camera driver needed to copy pixel data from a capture buffer to the device screen. There we could clearly see that the cost of a specific function call really hogged the overlay drawing performance.

The only way to be sure is to measure performance with and without. Unless you are writing highly performance critical code, this will usually be unnecessary.

For SIMD code.
SIMD code often uses constants/magic numbers. In a regular function, every const __m128 c = _mm_setr_ps(1,2,3,4); becomes a memory reference.
With __forceinline, compiler can load it once and reuse the value, unless your code exhausts registers (usually 16).
CPU caches are great but registers are still faster.
P.S. Just got 12% performance improvement by __forceinline alone.

The inline directive will be totally of no use when used for functions which are:
recursive,
long,
composed of loops,
If you want to force this decision using __forceinline

Actually, even with the __forceinline keyword. Visual C++ sometimes chooses not to inline the code. (Source: Resulting assembly source code.)
Always look at the resulting assembly code where speed is of importance (such as tight inner loops needed to be run on each frame).
Sometimes using #define instead of inline will do the trick. (of course you loose a lot of checking by using #define, so use it only when and where it really matters).

Actually, boost is loaded with it.
For example
BOOST_CONTAINER_FORCEINLINE flat_tree& operator=(BOOST_RV_REF(flat_tree) x)
BOOST_NOEXCEPT_IF( (allocator_traits_type::propagate_on_container_move_assignment::value ||
allocator_traits_type::is_always_equal::value) &&
boost::container::container_detail::is_nothrow_move_assignable<Compare>::value)
{ m_data = boost::move(x.m_data); return *this; }
BOOST_CONTAINER_FORCEINLINE const value_compare &priv_value_comp() const
{ return static_cast<const value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE value_compare &priv_value_comp()
{ return static_cast<value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE const key_compare &priv_key_comp() const
{ return this->priv_value_comp().get_comp(); }
BOOST_CONTAINER_FORCEINLINE key_compare &priv_key_comp()
{ return this->priv_value_comp().get_comp(); }
public:
// accessors:
BOOST_CONTAINER_FORCEINLINE Compare key_comp() const
{ return this->m_data.get_comp(); }
BOOST_CONTAINER_FORCEINLINE value_compare value_comp() const
{ return this->m_data; }
BOOST_CONTAINER_FORCEINLINE allocator_type get_allocator() const
{ return this->m_data.m_vect.get_allocator(); }
BOOST_CONTAINER_FORCEINLINE const stored_allocator_type &get_stored_allocator() const
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE stored_allocator_type &get_stored_allocator()
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE iterator begin()
{ return this->m_data.m_vect.begin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator begin() const
{ return this->cbegin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator cbegin() const
{ return this->m_data.m_vect.begin(); }

There are several situations where the compiler is not able to determine categorically whether it is appropriate or beneficial to inline a function. Inlining may involve trade-off's that the compiler is unwilling to make, but you are (e.g,, code bloat).
In general, modern compilers are actually pretty good at making this decision.

When you know that the function is going to be called in one place several times for a complicated calculation, then it is a good idea to use __forceinline. For instance, a matrix multiplication for animation may need to be called so many times that the calls to the function will start to be noticed by your profiler. As said by the others, the compiler can't really know about that, especially in a dynamic situation where the execution of the code is unknown at compile time.

wA Case For noinline
I wanted to pitch in with an unusual suggestion and actually vouch for __noinline in MSVC or the noinline attribute/pragma in GCC and ICC as an alternative to try out first over __forceinline and its equivalents when staring at profiler hotspots. YMMV but I've gotten so much more mileage (measured improvements) out of telling the compiler what to never inline than what to always inline. It also tends to be far less invasive and can produce much more predictable and understandable hotspots when profiling the changes.
While it might seem very counter-intuitive and somewhat backward to try to improve performance by telling the compiler what not to inline, I'd claim based on my experience that it's much more harmonious with how optimizing compilers work and far less invasive to their code generation. A detail to keep in mind that's easy to forget is this:
Inlining a callee can often result in the caller, or the caller of the caller, to cease to be inlined.
This is what makes force inlining a rather invasive change to the code generation that can have chaotic results on your profiling sessions. I've even had cases where force inlining a function reused in several places completely reshuffled all top ten hotspots with the highest self-samples all over the place in very confusing ways. Sometimes it got to the point where I felt like I'm fighting with the optimizer making one thing faster here only to exchange a slowdown elsewhere in an equally common use case, especially in tricky cases for optimizers like bytecode interpretation. I've found noinline approaches so much easier to use successfully to eradicate a hotspot without exchanging one for another elsewhere.
It would be possible to inline functions much less invasively if we
could inline at the call site instead of determining whether or not
every single call to a function should be inlined. Unfortunately, I've
not found many compilers supporting such a feature besides ICC. It
makes much more sense to me if we are reacting to a hotspot to respond
by inlining at the call site instead of making every single call of a
particular function forcefully inlined. Lacking this wide support
among most compilers, I've gotten far more successful results with
noinline.
Optimizing With noinline
So the idea of optimizing with noinline is still with the same goal in mind: to help the optimizer inline our most critical functions. The difference is that instead of trying to tell the compiler what they are by forcefully inlining them, we are doing the opposite and telling the compiler what functions definitely aren't part of the critical execution path by forcefully preventing them from being inlined. We are focusing on identifying the rare-case non-critical paths while leaving the compiler still free to determine what to inline in the critical paths.
Say you have a loop that executes for a million iterations, and there is a function called baz which is only very rarely called in that loop once every few thousand iterations on average in response to very unusual user inputs even though it only has 5 lines of code and no complex expressions. You've already profiled this code and the profiler shows in the disassembly that calling a function foo which then calls baz has the largest number of samples with lots of samples distributed around calling instructions. The natural temptation might be to force inline foo. I would suggest instead to try marking baz as noinline and time the results. I've managed to make certain critical loops execute 3 times faster this way.
Analyzing the resulting assembly, the speedup came from the foo function now being inlined as a result of no longer inlining baz calls into its body.
I've often found in cases like these that marking the analogical baz with noinline produces even bigger improvements than force inlining foo. I'm not a computer architecture wizard to understand precisely why but glancing at the disassembly and the distribution of samples in the profiler in such cases, the result of force inlining foo was that the compiler was still inlining the rarely-executed baz on top of foo, making foo more bloated than necessary by still inlining rare-case function calls. By simply marking baz with noinline, we allow foo to be inlined when it wasn't before without actually also inlining baz. Why the extra code resulting from inlining baz as well slowed down the overall function is still not something I understand precisely; in my experience, jump instructions to more distant paths of code always seemed to take more time than closer jumps, but I'm at a loss as to why (maybe something to do with the jump instructions taking more time with larger operands or something to do with the instruction cache). What I can definitely say for sure is that favoring noinline in such cases offered superior performance to force inlining and also didn't have such disruptive results on the subsequent profiling sessions.
So anyway, I'd suggest to give noinline a try instead and reach for it first before force inlining.
Human vs. Optimizer
In what scenario is it assumed that I know better than my compiler on
this issue?
I'd refrain from being so bold as to assume. At least I'm not good enough to do that. If anything, I've learned over the years the humbling fact that my assumptions are often wrong once I check and measure things I try with the profiler. I have gotten past the stage (over a couple of decades of making my profiler my best friend) to avoid completely blind stabs at the dark only to face humbling defeat and revert my changes, but at my best, I'm still making, at most, educated guesses. Still, I've always known better than my compiler, and hopefully, most of us programmers have always known this better than our compilers, how our product is supposed to be designed and how it is is going to most likely be used by our customers. That at least gives us some edge in the understanding of common-case and rare-case branches of code that compilers don't possess (at least without PGO and I've never gotten the best results with PGO). Compilers don't possess this type of runtime information and foresight of common-case user inputs. It is when I combine this user-end knowledge, and with a profiler in hand, that I've found the biggest improvements nudging the optimizer here and there in teaching it things like what to inline or, more commonly in my case, what to never inline.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js