Why is linker optimization so poor?

Why is linker optimization so poor? - c++

Recently, a coworker pointed out to me that compiling everything into a single file created much more efficient code than compiling separate object files - even with link time optimization turned on. In addition, the total compile time for the project went down significantly. Given that one of the primary reasons for using C++ is code efficiency, this was surprising to me.
Clearly, when the archiver/linker makes a library out of object files, or links them into an executable, even simple optimizations are penalized. In the example below, trivial inlining costs 1.8% in performance when done by the linker instead of the compiler. It seems like compiler technology should be advanced enough to handle fairly common situations like this, but it isn't happening.
Here is a simple example using Visual Studio 2008:
#include <cstdlib>
#include <iostream>
#include <boost/timer.hpp>
using namespace std;
int foo(int x);
int foo2(int x) { return x++; }
int main(int argc, char** argv)
{
boost::timer t;
t.restart();
for (int i=0; i<atoi(argv[1]); i++)
foo (i);
cout << "time : " << t.elapsed() << endl;
t.restart();
for (int i=0; i<atoi(argv[1]); i++)
foo2 (i);
cout << "time : " << t.elapsed() << endl;
}
foo.cpp
int foo (int x) { return x++; }
Results of run: 1.8% performance hit to using linked foo instead of inline foo2.
$ ./release/testlink.exe 100000000
time : 13.375
time : 13.14
And yes, the linker optimization flags (/LTCG) are on.

Your coworker is out of date. The technology has been here since 2003 (on the MS C++ compiler): /LTCG. Link time code generation is dealing with exactly this problem. From what I know the GCC has this feature on the radar for the next generation compiler.
LTCG does not only optimize the code like inlining functions across modules, but actually rearanges code to optimize cache locality and branching for a specific load, see Profile-Guided Optimizations. These options are usualy reserved only for Release builds as the build can take hours to finish: will link a instrumented executable, run a profiling load and then link again with the profiling results. The link contains details about what exactly is optimized with LTCG:
Inlining – For example, if there
exists a function A that frequently
calls function B, and function B is
relatively small, then profile-guided
optimizations will inline function B
in function A.
Virtual Call Speculation – If a
virtual call, or other call through a
function pointer, frequently targets a
certain function, a profile-guided
optimization can insert a
conditionally-executed direct call to
the frequently-targeted function, and
the direct call can be inlined.
Register Allocation – Optimizing with
profile data results in better
register allocation.
Basic Block Optimization – Basic block
optimization allows commonly executed
basic blocks that temporally execute
within a given frame to be placed in
the same set of pages (locality). This
minimizes the number of pages used,
thus minimizing memory overhead.
Size/Speed Optimization – Functions
where the program spends a lot of time
can be optimized for speed.
Function Layout – Based on the call
graph and profiled caller/callee
behavior, functions that tend to be
along the same execution path are
placed in the same section.
Conditional Branch Optimization – With
the value probes, profile-guided
optimizations can find if a given
value in a switch statement is used
more often than other values. This
value can then be pulled out of the
switch statement. The same can be done
with if/else instructions where the
optimizer can order the if/else so
that either the if or else block is
placed first depending on which block
is more frequently true.
Dead Code Separation – Code that is
not called during profiling is moved
to a special section that is appended
to the end of the set of sections.
This effectively keeps this section
out of the often-used pages.
EH Code Separation – The EH code,
being exceptionally executed, can
often be moved to a separate section
when profile-guided optimizations can
determine that the exceptions occur
only on exceptional conditions.
Memory Intrinsics – The expansion of
intrinsics can be decided better if it
can be determined if an intrinsic is
called frequently. An intrinsic can
also be optimized based on the block
size of moves or copies.

I'm not a compiler specialist, but I think the compiler has much more information available at disposal to optimize as it operates on a language tree, as opposed to the linker that has to content itself to operate on the object output, far less expressive than the code the compiler has seen. Hence less effort is spent by linker and compiler development team(s) into making linker optimizations that could match, in theory, the tricks the compiler does.
BTW, I'm sorry I distracted your original question into the ltcg discussion. I now understand your question was a little bit different, more concerned with the link time vs. compile time static optimizations possible/available.

Your coworker is smarter than most of us. Even if it seems a crude approach at first, project inlining into a single .cpp file has one thing that the other approaches like link-time-optimization does not have and will not have for while - reliability
However, you asked this two years ago, and I testify that a lot has changed since then (with g++ at least). Devirtualization is a lot more reliable, for instance.

Related

How to ensure some code is optimized away?

tl;dr: Can it be ensured somehow (e.g. by writing a unit test) that some things are optimized away, e.g. whole loops?
The usual approach to be sure that something is not included in the production build is wrapping it with #if...#endif. But I prefer to stay with C++ mechanics instead. Even there, instead of complicated template specializations I like to keep implementations simple and argue "hey, the compiler will optimize this out anyway".
Context is embedded SW in automotive (binary size matters) with often poor compilers. They are certified in the sense of safety, but usually not good in optimizations.
Example 1: In a container the destruction of elements is typically a loop:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
This works also for build-in types such as int, as the standard allows the explicit call of the destructor also for any scalar types (C++11 12.4-15). In such case the loop does nothing and is optimized out. In GCC it is, but in another (Aurix) not, I saw a literally empty loop in the disassembly! So that needed a template specialization to fix it.
Example 2: Code, which is intended for debugging, profiling or fault-injection etc. only:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
// Albeit 'dead' section, it may not appear in production binary!
// (size, security, safety...)
// 'if constexpr..' not an option (C++11)
std::cout << "Arg was " << arg << std::endl;
}
// normal code here...
}
I can look at the disassembly, sure. But being an upstream platform software it's hard to control all targets, compilers and their options one might use. The fear is big that due to any reason a downstream project has a code bloat or performance issue.
Bottom line: Is it possible to write the software in a way, that certain code is known to be optimized away in a safe manner as a #if would do? Or a unit tests, which give a fail if optimization is not as expected?
[Timing tests come to my mind for the first problem, but being bare-metal I don't have convenient tools yet.]

There may be a more elegant way, and it's not a unit test, but if you're just looking for that particular string, and you can make it unique,
strings $COMPILED_BINARY | grep "Arg was"
should show you if the string is being included

if constexpr is the canonical C++ expression (since C++17) for this kind of test.
constexpr bool DEBUG = /*...*/;
int main() {
if constexpr(DEBUG) {
std::cerr << "We are in debugging mode!" << std::endl;
}
}
If DEBUG is false, then the code to print to the console won't generate at all. So if you have things like log statements that you need for checking the behavior of your code, but which you don't want to interact with in production code, you can hide them inside if constexpr expressions to eliminate the code entirely once the code is moved to production.

Looking at your question, I see several (sub-)questions in it that require an answer. Not all answers might be possible with your bare-metal compilers as hardware vendors don't care that much about C++.
The first question is: How do I write code in a way that I'm sure it gets optimized. The obvious answer here is to put everything in a single compilation unit so the caller can see the implementation.
The second question is: How can I force a compiler to optimize. Here constexpr is a bless. Depending on whether you have support for C++11, C++14, C++17 or even the upcoming C++20, you'll get different feature sets of what you can do in a constexpr function. For the usage:
constexpr char c = std::string_view{"my_very_long_string"}[7];
With the code above, c is defined as a constexpr variable. Because you apply it to the variable, you require some things:
Your compiler should optimize the code so the value of c is known at compile time. This even holds true for -O0 builds!
All functions used for calculate c are constexpr and available. (and by result, enforce the behaviour of the first question)
No undefined behaviour is allowed to be triggered in the calculation of c. (For the given value)
The negative about this is: Your input needs to be known at compile time.
C++17 also provides if constexpr which has similar requirements: condition needs to be calculated at compile time. The result is that 1 branch of the code ain't allowed to be compiled (as it even can contain elements that don't work on the type you are using).
Which than brings us to the question: How do I ensure sufficient optimizations for my program to run fast enough, even if my compiler ain't well behaving. Here the only relevant answer is: create benchmarks and compare the results. Take the effort to setup a CI job that automates this for you. (And yes, you can even use external hardware although not being that easy) In the end, you have some requirements: handling A should take less than X seconds. Do A several times and time it. Even if they don't handle everything, as long as it's within the requirements, its fine.
Note: As this is about debug, you most likely can track the size of an executable as well. As soon as you start using streams, a lot of conversions to string ... your exe size will grow. (And you'll find it a bless as you will immediately find commits which add 10% to the image size)
And than the final question: You have a buggy compiler, it doesn't meet my requirements. Here the only answer is: Replace it. In the end, you can use any compiler to compiler your code to bare metal, as long as the linker scripts work. If you need a start, C++Now 2018: Michael Caisse “Modern C++ in Embedded Systems” gives you a very good idea of what you need to use a different compiler. (Like a recent Clang or GCC, on which you even can log bugs if the optimization ain't good enough)

Insert a reference to external data or function into the block that should be verified to be optimised away. Like this:
extern void nop();
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
nop();
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
In Debug-Builds, link with an implementation of nop() in a extra compilation unit nop.cpp:
void nop() {}
In Release-Builds, don't provide an implementation.
Release builds will only link if the optimisable code is eliminated.
`- kisch

Here's another nice solution using inline assembly.
This uses assembler directives only, so it might even be kind of portable (checked with clang).
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm(".globl _marker\n_marker:\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
// normal code here...
}
This would leave an exported linker symbol in the compiled executable, if the code isn't optimised away. You can check for this symbol using nm(1).
clang can even stop the compilation right away:
constexpr bool isDebugging = false; // somehow a global flag
void foo(int arg) {
if( isDebugging ) {
asm("_marker=1\n");
std::cout << "Arg was " << arg << std::endl; // may not appear in production binary!
}
asm volatile (
".ifdef _marker\n"
".err \"code not optimised away\"\n"
".endif\n"
);
// normal code here...
}

This is not an answer to "How to ensure some code is optimized away?" but to your summary line "Can a unit test be written that e.g. whole loops are optimized away?".
First, the answer depends on how far you see the scope of unit-testing - so if you put in performance tests, you might have a chance.
If in contrast you understand unit-testing as a way to test the functional behaviour of the code, you don't. For one thing, optimizations (if the compiler works correctly) shall not change the behaviour of standard-conforming code.
With incorrect code (code that has undefined behaviour) optimizers can do what they want. (Well, for code with undefined behaviour the compiler can do it also in the non-optimizing case, but sometimes only the deeper analyses peformed during optimization make it possible for the compiler to detect that some code has undefined behaviour.) Thus, if you write unit-tests for some piece of code with undefined behaviour, the test results may differ when you run the tests with and without optimization. But, strictly speaking, this only tells you that the compiler translated the code both times in a different way - it does not guarantee you that the code is optimized in the way you want it to be.

Here's another different way that also covers the first example.
You can verify (at runtime) that the code has been eliminated, by comparing two labels placed around it.
This relies on the GCC extension "Labels as Values" https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
before:
for(size_t i = 0; i<elements; i++)
buffer[i].~T();
behind:
if (intptr_t(&&behind) != intptr_t(&&before)) abort();
It would be nice if you could check this in a static_assert(), but sadly the difference of &&label expressions is not accepted as compile-time constant.
GCC insists on inserting a runtime comparison, even though both labels are in fact at the same address.
Interestingly, if you compare the addresses (type void*) directly, without casting them to intptr_t, GCC falsely optimises away the if() as "always true", whereas clang correctly optimises away the complete if() as "always false", even at -O1.

C++ How to determine if function has capability for being inlined and actually was?

I have a question about inline functions in C++. I know that similar questions have appeared many times on this. I hope that mine is a little bit different.
I know that when you specify some function to be inline it is just a "suggestion" to the compiler. So in case:
inline int func1()
{
return 2;
}
Some code later
cout << func1() << endl; // replaced by cout << 2 << endl;
So there is no mystery there, but what about cases like this:
inline int func1()
{
return 2;
}
inline int func2()
{
return func1() * 2;
}
inline int func3()
{
return func2() * func1() * 2;
}
And so on...
Which of these functions have a chance to become inlined, is it benefitial and how to check what compiler actually did?

Which of these functions have a chance to become inlined
Any and all functions have a chance to become inlined, if the tool(1) doing the inlining has access to the function's definition (= body) ...
is it benefitial
... and deems it beneficial to do so. Nowadays, it's the job of the optimiser to determine where inlining makes sense, and for 99.9% of programs, the best the programmer can do is stay out of the optimiser's way. The remaining few cases are programs like Facebook, where 0.3% of performance loss is a huge regression. In such cases, manual tweaking of optimisations (along with profiling, profiling, and profiling) is the way to go.
how to check what compiler actually did
By inspecting the generated assembly. Every compiler has a flag to make it output assembly in "human-readable" format instead of (or in addition to) object files in binary form.
(1) Normally, this tool is the compiler and inlining happens as part of the compilation step (turning source code into assembly/object files). That is also the only reason why you may be required to use the inline keyword to actually allow a compiler to inline: because the function's definition must be visible in the translation unit (= source file) being compiled, and quite often that means putting the function definition into a header file. Without inline, this would then lead to multiple-definition errors if the header file was included in more than one translation unit.
Note that compilation is not the only stage when inlining is possible. When you enable Whole-Program Optimisation (also known as Link-Time Code Generation), one more pass of optimisation happens at link time, once all object files are created. At this point, the inline keyword is totally irrelevant, since linking has access to all the function definitions (the binary wouldn't link successfully otherwise). This is therefore the way to get the most benefit from inlining without having to think about it at all when writing code. The downside is time: WPO takes time to run, and for large projcts, can prolong link times to unacceptable levels (I've personally experienced a somewhat pathological case where enabling WPO took a program's link time from 7 minutes to 46).

Think of inline as only a hint to the compiler, a bit like register was in old versions of C++ and C standards. Caveat, register is being obsoleted (in C++17).
Which of these functions have a chance to become inlined, is it benefitial
Trust your compiler on making sane inlining decisions. To enable some particular occurrence of a call, the compiler needs to know the body of the called function. You should not care if the compiler is inlining or not (in theory).
In practice, with the GCC compiler:
inlining is not always improving the performance (e.g. because of CPU cache issues, TLB, branch predictor, etc etc....)
inlining decisions depends a lot on optimization options. It probably is more likely to happen with -O3 than with -O1; there are many guru options (like -finline-limit= and others) to tune it.
notice that individual calls get inlined or not. It is quite possible that some call occurrence like foo(x) at line 123 is inlined, but another call occurrence (to the same function foo) like foo(y) at some other place like line 456 is not inlined.
when debugging, you may want to disable inlining (because that makes the debugging more convenient). This is possible with the -fno-inline GCC optimization flag (which I often use with -g, which asks for debugging information).
the always_inline function attribute "forces" inlining, and the noinline prevents it.
if you compile and link with link time optimization (LTO) such as -flto -O2 (or -flto -O3), e.g. with CXX=g++ -flto -O2 in your Makefile, inlining can happen between several translation units (e.g. C++ source files). However LTO is at least doubling the compilation time (and often, worse) and consumes memory during compilation (so better have a lot of RAM then), and often improve performance by only a few percents (with weird exceptions to this rule of thumb).
you might optimize a function differently, e.g. with #pragma GCC optimize ("-O3") or with function attribute optimize
Look also into profile-guided optimizations with instrumentation options like -fprofile-generate and latter optimizations with -fprofile-use with other optimization flags.
If you are curious about what calls are inlined (and sometimes, some won't be) look into the generated assembler (e.g. use g++ -O2 -S -fverbose-asm and look in the .s assembler file), or use some internal dump options.
The observable behavior of your code (except performance) should not depend upon inlining decisions made by your compiler. In other words, don't expect inlining to happen (or not). If your code behave differently with or without some optimization it is likely to be buggy. So read about undefined behavior.
See also MILEPOST GCC project (using machine learning techniques for optimization purposes).

Is link time optimization in gcc 5.1 good enough to give up on inlining simple functions?

Out of habit I often write function definitions inline for simple functions such as this (contrived example)
class PositiveInteger
{
private:
long long unsigned m_i;
public:
PositiveInteger (int i);
};
inline PositiveInteger :: PositiveInteger (int i)
: m_i (i)
{
if (i < 0)
throw "oops";
}
I generally like to separate interface files and implementation files but, nevertheless, this is my habit for those functions which the voice in my head tells me will probably be hit a lot in hot spots.
I know the advice is "profile first" and I agree but I could avoid a whole load of profiling effort if I knew a priori that the compiler would produce identical final object code whether functions like this were inlined at compilation or link time. (Also, I believe the injected profiling code itself can cause a change in timing which swamps the effect of very simple functions such as the one above.)
GCC 5.1 has just been released advertising LTO (link time optimization) improvements. How good are they really? What kinds of functions can I safely un-inline knowing the final executable will not be affected?

You already answered your own question: Unless you're targeting an embedded system of some sort with restricted resources, write the code for clarity and maintainability first. Then if performance isn't acceptable you can profile and target your efforts towards the actual hotspots. Think about it: If you write clearer code that takes an extra 250ns that's not noticeable in your use case then the extra time doesn't matter.

GCC with LTO does cross-module inlining, so most of the time you should not see difference in code quality. Offline function declarations are also not duplicated across translation units and compile faster/produce smaller object files.
GCC's inlining heuristics however consider "inline" keyword as a hint that function is probably good to be inlined and increase limits on function size. Similarly it will take bit of extra hint for functions declared in the same translation unit as called. For small functions like one in your example this should not however make any difference.

Is there a reason why not to use link-time optimization (LTO)?

GCC, MSVC, LLVM, and probably other toolchains have support for link-time (whole program) optimization to allow optimization of calls among compilation units.
Is there a reason not to enable this option when compiling production software?

I assume that by "production software" you mean software that you ship to the customers / goes into production. The answers at Why not always use compiler optimization? (kindly pointed out by Mankarse) mostly apply to situations in which you want to debug your code (so the software is still in the development phase -- not in production).
6 years have passed since I wrote this answer, and an update is necessary. Back in 2014, the issues were:
Link time optimization occasionally introduced subtle bugs, see for example Link-time optimization for the kernel. I assume this is less of an issue as of 2020. Safeguard against these kinds of compiler and linker bugs: Have appropriate tests to check the correctness of your software that you are about to ship.
Increased compile time. There are claims that the situation has significantly improved since 2014, for example thanks to slim objects.
Large memory usage. This post claims that the situation has drastically improved in recent years, thanks to partitioning.
As of 2020, I would try to use LTO by default on any of my projects.

This recent question raises another possible (but rather specific) case in which LTO may have undesirable effects: if the code in question is instrumented for timing, and separate compilation units have been used to try to preserve the relative ordering of the instrumented and instrumenting statements, then LTO has a good chance of destroying the necessary ordering.
I did say it was specific.

If you have well written code, it should only be advantageous. You may hit a compiler/linker bug, but this goes for all types of optimisation, this is rare.
Biggest downside is it drastically increases link time.

Apart from to this,
Consider a typical example from embedded system,
void function1(void) { /*Do something*/} //located at address 0x1000
void function2(void) { /*Do something*/} //located at address 0x1100
void function3(void) { /*Do something*/} //located at address 0x1200
With predefined addressed functions can be called through relative addresses like below,
(*0x1000)(); //expected to call function2
(*0x1100)(); //expected to call function2
(*0x1200)(); //expected to call function3
LTO can lead to unexpected behavior.
updated:
In automotive embedded SW development,Multiple parts of SW are compiled and flashed on to a separate sections.
Boot-loader, Application/s, Application-Configurations are independently flash-able units. Boot-loader has special capabilities to update Application and Application-configuration. At every power-on cycle boot-loader ensures the SW application and application-configuration's compatibility and consistence via Hard-coded location for SW-Versions and CRC and many more parameters. Linker-definition files are used to hard-code the variable location and some function location.

Given that the code is implemented correctly, then link time optimization should not have any impact on the functionality. However, there are scenarios where not 100% correct code will typically just work without link time optimization, but with link time optimization the incorrect code will stop working. There are similar situations when switching to higher optimization levels, like, from -O2 to -O3 with gcc.
That is, depending on your specific context (like, age of the code base, size of the code base, depth of tests, are you starting your project or are you close to final release, ...) you would have to judge the risk of such a change.
One scenario where link-time-optimization can lead to unexpected behavior for wrong code is the following:
Imagine you have two source files read.c and client.c which you compile into separate object files. In the file read.c there is a function read that does nothing else than reading from a specific memory address. The content at this address, however, should be marked as volatile, but unfortunately that was forgotten. From client.c the function read is called several times from the same function. Since read only performs one single read from the address and there is no optimization beyond the boundaries of the read function, read will always when called access the respective memory location. Consequently, every time when read is called from client.c, the code in client.c gets a freshly read value from the address, just as if volatile had been used.
Now, with link-time-optimization, the tiny function read from read.c is likely to be inlined whereever it is called from client.c. Due to the missing volatile, the compiler will now realize that the code reads several times from the same address, and may therefore optimize away the memory accesses. Consequently, the code starts to behave differently.

Rather than mandating that all implementations support the semantics necessary to accomplish all tasks, the Standard allows implementations intended to be suitable for various tasks to extend the language by defining semantics in corner cases beyond those mandated by the C Standard, in ways that would be useful for those tasks.
An extremely popular extension of this form is to specify that cross-module function calls will be processed in a fashion consistent with the platform's Application Binary Interface without regard for whether the C Standard would require such treatment.
Thus, if one makes a cross-module call to a function like:
uint32_t read_uint32_bits(void *p)
{
return *(uint32_t*)p;
}
the generated code would read the bit pattern in a 32-bit chunk of storage at address p, and interpret it as a uint32_t value using the platform's native 32-bit integer format, without regard for how that chunk of storage came to hold that bit pattern. Likewise, if a compiler were given something like:
uint32_t read_uint32_bits(void *p);
uint32_t f1bits, f2bits;
void test(void)
{
float f;
f = 1.0f;
f1bits = read_uint32_bits(&f);
f = 2.0f;
f2bits = read_uint32_bits(&f);
}
the compiler would reserve storage for f on the stack, store the bit pattern for 1.0f to that storage, call read_uint32_bits and store the returned value, store the bit pattern for 2.0f to that storage, call read_uint32_bits and store that returned value.
The Standard provides no syntax to indicate that the called function might read the storage whose address it receives using type uint32_t, nor to indicate that the pointer the function was given might have been written using type float, because implementations intended for low-level programming already extended the language to supported such semantics without using special syntax.
Unfortunately, adding in Link Time Optimization will break any code that relies upon that popular extension. Some people may view such code as broken, but if one recognizes the Spirit of C principle "Don't prevent programmers from doing what needs to be done", the Standard's failure to mandate support for a popular extension cannot be viewed as intending to deprecate its usage if the Standard fails to provide any reasonable alternative.

LTO could also reveal edge-case bugs in code-signing algorithms. Consider a code-signing algorithm based on certain expectations about the TEXT portion of some object or module. Now LTO optimizes the TEXT portion away, or inlines stuff into it in a way the code-signing algorithm was not designed to handle. Worst case scenario, it only affects one particular distribution pipeline but not another, due to a subtle difference in which encryption algorithm was used on each pipeline. Good luck figuring out why the app won't launch when distributed from pipeline A but not B.

LTO support is buggy and LTO related issues has lowest priority for compiler developers. For example: mingw-w64-x86_64-gcc-10.2.0-5 works fine with lto, mingw-w64-x86_64-gcc-10.2.0-6 segfauls with bogus address. We have just noticed that windows CI stopped working.
Please refer the following issue as an example.

When should I use __forceinline instead of inline?

Visual Studio includes support for __forceinline. The Microsoft Visual Studio 2005 documentation states:
The __forceinline keyword overrides
the cost/benefit analysis and relies
on the judgment of the programmer
instead.
This raises the question: When is the compiler's cost/benefit analysis wrong? And, how am I supposed to know that it's wrong?
In what scenario is it assumed that I know better than my compiler on this issue?

You know better than the compiler only when your profiling data tells you so.

The one place I am using it is licence verification.
One important factor to protect against easy* cracking is to verify being licenced in multiple places rather than only one, and you don't want these places to be the same function call.
*) Please don't turn this in a discussion that everything can be cracked - I know. Also, this alone does not help much.

The compiler is making its decisions based on static code analysis, whereas if you profile as don says, you are carrying out a dynamic analysis that can be much farther reaching. The number of calls to a specific piece of code is often largely determined by the context in which it is used, e.g. the data. Profiling a typical set of use cases will do this. Personally, I gather this information by enabling profiling on my automated regression tests. In addition to forcing inlines, I have unrolled loops and carried out other manual optimizations on the basis of such data, to good effect. It is also imperative to profile again afterwards, as sometimes your best efforts can actually lead to decreased performance. Again, automation makes this a lot less painful.
More often than not though, in my experience, tweaking alogorithms gives much better results than straight code optimization.

I've developed software for limited resource devices for 9 years or so and the only time I've ever seen the need to use __forceinline was in a tight loop where a camera driver needed to copy pixel data from a capture buffer to the device screen. There we could clearly see that the cost of a specific function call really hogged the overlay drawing performance.

The only way to be sure is to measure performance with and without. Unless you are writing highly performance critical code, this will usually be unnecessary.

For SIMD code.
SIMD code often uses constants/magic numbers. In a regular function, every const __m128 c = _mm_setr_ps(1,2,3,4); becomes a memory reference.
With __forceinline, compiler can load it once and reuse the value, unless your code exhausts registers (usually 16).
CPU caches are great but registers are still faster.
P.S. Just got 12% performance improvement by __forceinline alone.

The inline directive will be totally of no use when used for functions which are:
recursive,
long,
composed of loops,
If you want to force this decision using __forceinline

Actually, even with the __forceinline keyword. Visual C++ sometimes chooses not to inline the code. (Source: Resulting assembly source code.)
Always look at the resulting assembly code where speed is of importance (such as tight inner loops needed to be run on each frame).
Sometimes using #define instead of inline will do the trick. (of course you loose a lot of checking by using #define, so use it only when and where it really matters).

Actually, boost is loaded with it.
For example
BOOST_CONTAINER_FORCEINLINE flat_tree& operator=(BOOST_RV_REF(flat_tree) x)
BOOST_NOEXCEPT_IF( (allocator_traits_type::propagate_on_container_move_assignment::value ||
allocator_traits_type::is_always_equal::value) &&
boost::container::container_detail::is_nothrow_move_assignable<Compare>::value)
{ m_data = boost::move(x.m_data); return *this; }
BOOST_CONTAINER_FORCEINLINE const value_compare &priv_value_comp() const
{ return static_cast<const value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE value_compare &priv_value_comp()
{ return static_cast<value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE const key_compare &priv_key_comp() const
{ return this->priv_value_comp().get_comp(); }
BOOST_CONTAINER_FORCEINLINE key_compare &priv_key_comp()
{ return this->priv_value_comp().get_comp(); }
public:
// accessors:
BOOST_CONTAINER_FORCEINLINE Compare key_comp() const
{ return this->m_data.get_comp(); }
BOOST_CONTAINER_FORCEINLINE value_compare value_comp() const
{ return this->m_data; }
BOOST_CONTAINER_FORCEINLINE allocator_type get_allocator() const
{ return this->m_data.m_vect.get_allocator(); }
BOOST_CONTAINER_FORCEINLINE const stored_allocator_type &get_stored_allocator() const
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE stored_allocator_type &get_stored_allocator()
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE iterator begin()
{ return this->m_data.m_vect.begin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator begin() const
{ return this->cbegin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator cbegin() const
{ return this->m_data.m_vect.begin(); }

There are several situations where the compiler is not able to determine categorically whether it is appropriate or beneficial to inline a function. Inlining may involve trade-off's that the compiler is unwilling to make, but you are (e.g,, code bloat).
In general, modern compilers are actually pretty good at making this decision.

When you know that the function is going to be called in one place several times for a complicated calculation, then it is a good idea to use __forceinline. For instance, a matrix multiplication for animation may need to be called so many times that the calls to the function will start to be noticed by your profiler. As said by the others, the compiler can't really know about that, especially in a dynamic situation where the execution of the code is unknown at compile time.

wA Case For noinline
I wanted to pitch in with an unusual suggestion and actually vouch for __noinline in MSVC or the noinline attribute/pragma in GCC and ICC as an alternative to try out first over __forceinline and its equivalents when staring at profiler hotspots. YMMV but I've gotten so much more mileage (measured improvements) out of telling the compiler what to never inline than what to always inline. It also tends to be far less invasive and can produce much more predictable and understandable hotspots when profiling the changes.
While it might seem very counter-intuitive and somewhat backward to try to improve performance by telling the compiler what not to inline, I'd claim based on my experience that it's much more harmonious with how optimizing compilers work and far less invasive to their code generation. A detail to keep in mind that's easy to forget is this:
Inlining a callee can often result in the caller, or the caller of the caller, to cease to be inlined.
This is what makes force inlining a rather invasive change to the code generation that can have chaotic results on your profiling sessions. I've even had cases where force inlining a function reused in several places completely reshuffled all top ten hotspots with the highest self-samples all over the place in very confusing ways. Sometimes it got to the point where I felt like I'm fighting with the optimizer making one thing faster here only to exchange a slowdown elsewhere in an equally common use case, especially in tricky cases for optimizers like bytecode interpretation. I've found noinline approaches so much easier to use successfully to eradicate a hotspot without exchanging one for another elsewhere.
It would be possible to inline functions much less invasively if we
could inline at the call site instead of determining whether or not
every single call to a function should be inlined. Unfortunately, I've
not found many compilers supporting such a feature besides ICC. It
makes much more sense to me if we are reacting to a hotspot to respond
by inlining at the call site instead of making every single call of a
particular function forcefully inlined. Lacking this wide support
among most compilers, I've gotten far more successful results with
noinline.
Optimizing With noinline
So the idea of optimizing with noinline is still with the same goal in mind: to help the optimizer inline our most critical functions. The difference is that instead of trying to tell the compiler what they are by forcefully inlining them, we are doing the opposite and telling the compiler what functions definitely aren't part of the critical execution path by forcefully preventing them from being inlined. We are focusing on identifying the rare-case non-critical paths while leaving the compiler still free to determine what to inline in the critical paths.
Say you have a loop that executes for a million iterations, and there is a function called baz which is only very rarely called in that loop once every few thousand iterations on average in response to very unusual user inputs even though it only has 5 lines of code and no complex expressions. You've already profiled this code and the profiler shows in the disassembly that calling a function foo which then calls baz has the largest number of samples with lots of samples distributed around calling instructions. The natural temptation might be to force inline foo. I would suggest instead to try marking baz as noinline and time the results. I've managed to make certain critical loops execute 3 times faster this way.
Analyzing the resulting assembly, the speedup came from the foo function now being inlined as a result of no longer inlining baz calls into its body.
I've often found in cases like these that marking the analogical baz with noinline produces even bigger improvements than force inlining foo. I'm not a computer architecture wizard to understand precisely why but glancing at the disassembly and the distribution of samples in the profiler in such cases, the result of force inlining foo was that the compiler was still inlining the rarely-executed baz on top of foo, making foo more bloated than necessary by still inlining rare-case function calls. By simply marking baz with noinline, we allow foo to be inlined when it wasn't before without actually also inlining baz. Why the extra code resulting from inlining baz as well slowed down the overall function is still not something I understand precisely; in my experience, jump instructions to more distant paths of code always seemed to take more time than closer jumps, but I'm at a loss as to why (maybe something to do with the jump instructions taking more time with larger operands or something to do with the instruction cache). What I can definitely say for sure is that favoring noinline in such cases offered superior performance to force inlining and also didn't have such disruptive results on the subsequent profiling sessions.
So anyway, I'd suggest to give noinline a try instead and reach for it first before force inlining.
Human vs. Optimizer
In what scenario is it assumed that I know better than my compiler on
this issue?
I'd refrain from being so bold as to assume. At least I'm not good enough to do that. If anything, I've learned over the years the humbling fact that my assumptions are often wrong once I check and measure things I try with the profiler. I have gotten past the stage (over a couple of decades of making my profiler my best friend) to avoid completely blind stabs at the dark only to face humbling defeat and revert my changes, but at my best, I'm still making, at most, educated guesses. Still, I've always known better than my compiler, and hopefully, most of us programmers have always known this better than our compilers, how our product is supposed to be designed and how it is is going to most likely be used by our customers. That at least gives us some edge in the understanding of common-case and rare-case branches of code that compilers don't possess (at least without PGO and I've never gotten the best results with PGO). Compilers don't possess this type of runtime information and foresight of common-case user inputs. It is when I combine this user-end knowledge, and with a profiler in hand, that I've found the biggest improvements nudging the optimizer here and there in teaching it things like what to inline or, more commonly in my case, what to never inline.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js