What are the caveats to using inline optimization in C++ functions? - c++

What would be the benefits of inlining different types of function and what are the issues that would I would need to watch out for when developing around them? I am not so useful with a profiler but many different algorithmic applications it seems to increase the speed 8 times over, if you can give any pointers that'd be of great use to me.

Inline functions are oft' overused, and the consequences are significant. Inline indicates to the compiler that a function may be considered for inline expansion. If the compiler chooses to inline a function, the function is not called, but copied into place. The performance gain comes in avoiding the function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware, that they can increase program size. They can increase execution time by reducing the caller's locality of reference. When sizes increase, the caller's inner loop may no longer fit in the processor cache, causing unnecessary cache misses and the consequent performance hit. Inline functions also increase build times - if inline functions change, the world must be recompiled. Some guidelines:
Avoid inlining functions until profiling indicates which functions could benefit from inline.
Consider using your compiler's option for auto-inlining after profiling both with and without auto-inlining.
Only inline functions where the function call overhead is large relative to the function's code. In other words, inlining large functions or functions that call other (possibly inlined) functions is not a good idea.

The most important pointer is that you should in almost all cases let the compiler do its thing and not worry about it.
The compiler is free to perform inline expansion of a function even if you do not declare it inline, and it is free not to perform inline expansion even if you do declare it inline. It's entirely up to the compiler, which is okay, because in most cases it knows far better than you do when a function should be expanded inline.

One of the reason the compiler does a better job inlining than the programmer is because the cost/benefit tradeoff is actually decided at the lowest level of machine abstraction: how many assembly instructions make up the function that you want to inline. Consider the ratio between the execution time of a typical non-branching assembly instruction versus a function call. This ratio is predictable to the machine code generator, so that's why the compiler can use that information to guide inlining.
The high level compiler will often try to take care of another opportunity for inlining: when a function B is only called from function A and never called from elsewhere. This inlining is not done for performance reason (assuming A and B are not small functions), but is useful in reducing linking time by reducing the total number of "functions" that need to be generated.
Added examples
An example of where the compiler performs massive inlining (with massive speedup) is in the compilation of the STL containers. The STL container classes are written to be highly generic, and in return each "function" only performs a tiny bit of operation. When inlining is disabled, for example when compiling in debug mode, the speed of STL containers drop considerably.
A second example would be when the callee function contains certain instructions that require the stack to be undisturbed between the caller and callee. This happens with SIMD instructions using intrinsics. Fortunately, the compilers are smart enough to automatically inline these callee functions because they can inspect whether SIMD assembly instructions are emitted and inline them to make sure the stack is undisturbed.
The bottom line
unless you are familiar with low-level profiling and are good at assembly programming/optimization, it is better to let the compiler do the job. The STL is a special case in which it might make sense to enable inlining (with a switch) even in debug mode.

The main benefits of inlining a function are that you remove the calling overhead and allow the compiler to optimize across the call boundaries. Generally, the more freedom you give the optimizer, the better your program will perform.
The downside is that the function no longer exists. A debugger won't be able to tell you're inside of it, and no outside code can call it. You also can't replace its definition at run time, as the function body exists in many different locations.
Furthermore, the size of your binary is increased.
Generally, you should declare a function static if it has no external callers, rather than marking it inline. Only let a function be inlined if you're sure there are no negative side effects.

Function call overhead is pretty small. A more significant advantage of inline functions is the ability to use "by reference" variables directly without needing an extra level of pointer indirection. A function which makes heavy use of parameters passed by reference may benefit greatly if its parameters devolve to simple variables or fields.

Related

Is there a performance difference by wrapping the code in an immediately invoked lambda expression?

Here is some code:
void f()
{
// stuff
{
// code
}
}
It is also possible to write it in a strange way like this using lambdas instead of braces:
void f()
{
// same stuff as above
[&]{
// same code as above
}();
}
Will there be any performance difference between the two versions?
According to my checks, there is no difference in generated assembly in clang when compiled with optimizations, so I assume there will be no performance overhead. But is this always the case?
Performance is hard to measure. Yet in this case it's still reasonable to reason about it.
Every lambda is actually its own class with operator() implemented. This class has the same characteristics as one written in an unnamed namespace. Some relevant elements: its visibility is limited to the .cpp file and linked to it, it doesn't need to expose function pointers in the .obj file.
What could a compiler do differently with an immediately invoked lambda? Not much actually: it can prevent inlining it. In my experience this has the same behavior as unnamed functions: either the function is too large or it's used multiple times. This last could be the result of a function returning a lambda.
If the function is too large, than it could be that some paths where the function isn't called are faster by not inlining it.
If it's called multiple times, it enlarges your binary to inline it twice, which could slow it down.
To me, the bigger risk is that you call some templated function like std::sort with a lambda and copy that function body all over the place to bloat your binary. However as these were already templates before and std:: function is known for its measurable performance effects, I don't think it's worth the effort.
That said, I use lambdas all over the place. I even provide class templates that have them as members in performance critical code. Lambdas are considered zero overhead, though depending on how you use them you could find edge cases where a flow in your program slows down.
Some last piece of advice, even in a language like C++, readability is important. Having large lambdas isn't considered readable. I've seen style guide rules limiting it to 5 or 10 lines.
Immediately invoked lambdas have their uses, however, for your example this actually is only overhead from the reader's perspective.
Go and measure! If you have performance critical code, write a performance test for continuous monitoring and run a profiler from time to time to see where time is spent.

Inlining a function that manipulates data on heap

I'm working on optimizing a code where most of the objects are allocated on heap.
What I'm trying to understand is: if/why the compiler might not inline a function call that potentially manipulates data on heap.
To make things more clear, suppose you have the following code:
class A
{
public:
void foo() // non-const function
{
// modify data
i++;
...
}
private:
int i;
// can be anything here, including pointers
};
int main()
{
A a; // allocate something on stack
auto ptr = std::make_unique<A>(); // allocate something on heap
a.foo(); // case 1
ptr->foo(); // case 2
return 0;
}
Is it possible that a.foo() gets inlined while ptr->foo() does not?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
Assume that there are no virtual functions
EDIT: I guess my question is partially theoretical. Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
That is not relevant. Inline function and "regular" function calls have the same effect on the heap.
The implementation, inline or not, is in the code segment anyway.
Is it possible that a.foo() gets inlined while ptr->foo() does not?
Highly unlikely. Both of these calls will be probably inlined if the implementation is visible to the compiler and the compiler decide that it would be beneficial.
I used "case 2" in my code numerous times and it was always inlined using g++.
Although it is mostly implementation specific, there are no real limitation that restrict pointer function call compared to calling using an on stack object (beside the virtual functions which you already mentioned).
You should note that the produced inlined code might still be different. Case 2 will have to first determine the actual address which will have an impact on the performance, but it should be pretty much the same from there.
if/why the compiler might not inline a function call that potentially manipulates data on heap.
The compiler is free to inline or not a function call (and might decide that after devirtualization). The inlining decision is the freedom of the compiler (so inline keyword, like register, is often ignored to make optimizing decisions). The compiler often would decide to inline (or not) every particular call (so every occurrence of the called function name).
Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
This is really easy. Often, (among other criteria) the inlining is decided according to the depth of previously inlined nested function calls, or according the current size of the expanded internal representation. So it does happen that a particular occurrence of ptr->foo() would be inlined (e.g. because it occurs in a small function) but another occurrence of a.foo() won't be inlined.
Remember, inlining decisions is generally taken at each call site. And on some compilers, the thresholds used by the compiler may vary or can be tuned.
But inlining does not always speed up execution time (because of CPU cache and branch predictor issues, and many other mysteries....), and that is yet another reason why sometimes a compiler won't inline a particular call.
For GCC compiler, read about inline functions and various optimization options (notice that -finline-limit=100 and -finline-limit=200 will give different inlining decisions; you could even play with different --params options; the MILEPOST GCC project used machine learning techniques to tune these....).
Perhaps some compilers can more easily do devirtualization for stack allocated data (I really don't know, and compilers are making progress on such issues). This is probably the reason why (perhaps!) heap vs stack allocation could influence inlining decisions.

Efficiency penalty of initializing a struct/class within a loop

I've done my best to find an answer to this with no luck. Also, I've tested it and don't see any difference whatsoever in an optimized release build (there is a difference in debug)... still, I can't imagine why there is no difference, or how the optimizer is able to remove the penalty, and maybe someone knows what is happening internally.
If I create new instances of a simple class/struct within a loop, is there any penalty in efficiency for creating the class/struct on every loop iteration?
i.e.
struct mystruct
{
inline mystruct(const double &initial) : _myvalue(initial) {}
double myvalue;
}
why does...
for(int i=0; i<big_int; ++i)
{
mystruct a = mystruct(1.1)
}
take the same amount of real time as
for(int i=0; i<big_int; ++i)
{
double s = 1.1
}
?? Shouldn't there be some time required for the constructor/initialization?
This is easy-peasy work for a modern optimizer to handle.
As a programmer you might look at that constructor and struct and think it has to cost something. "The constructor code involves branching, passing arguments through registers/stack, popping from the stack, etc. The struct is a user-defined type, it must add more data somewhere. There's aliasing/indirection overhead for the const reference, etc."
Except the optimizer then has a go at your code, and it notices that the struct has no virtual functions, it has no objects that require a non-trivial constructor. The whole thing fits into a general-purpose register. And then it notices that your constructor is doing little more than assigning one variable to another. And it'll probably even notice that you're just calling it with a literal constant, which translates to a single move/store instruction to a register which doesn't even require any additional memory beyond the instruction.
It's all very magical, and compilers are sophisticated beasts, but they usually do this in multiple passes, and from your original code to intermediate representations, and from intermediate representations to machine code. To really appreciate and understand what they do, it's worth having a peek at the disassembly from time to time.
It's worth noting that C++ has been around for decades. As a successor to C, it originally was pushed mostly as an object-oriented language with hot concepts like encapsulation and information hiding. To promote a language where people start replacing public data members and manual initialization/destruction and things like that for simple accessor functions, constructors, destructors, it would have been very difficult to popularize the language if there was a measurable overhead in even a simple function call. So as magical as this all sounds, C++ optimizers have been doing this now for decades, squashing all that overhead you add to make things easier to maintain down to the same assembly as something which wouldn't be so easy to maintain.
So it's generally worth thinking of things like function calls and small structures as being basically free, since if it's worth inlining and squashing away all the overhead to zilch, optimizers will generally do it. Exceptions arise with indirect function calls: virtual methods, calls through function pointers, etc. But the code you posted is easy stuff for a modern optimizer to squash down.
C++ philosophy is that you should not "pay" (in CPU cycles or in memory bytes) for anything that you do not use. The struct in your example is nothing more than a double with a constructor tied to it. Moreover, the constructor can be inlined, bringing the overhead all the way down to zero.
If your struct had other parts to initialize, such as other fields or a table of virtual functions, there would be some overhead. The way your example is set up, however, the compiler can optimize out the constructor, producing an assembly output that boils down to a single assignment of a double.
Neither of your loops do anything. Dead code may be removed. Furthermore, there is no representational difference between a struct containing a single double and a primitive double. The compier should be able to easily "see through" an inline constructor. C++ relies on optimisations of these things to allow its abstractions to compete with hand-written versions.
There is no reason for the performance to be different, and if it were, I would consider it a bug (up to debug builds, where debug information could change the performance cost).
These quotes from the C++ Standard may help to understand what optimization is permitted:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
and also:
The least requirements on a conforming implementation are:
Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place in such a fashion that prompting output is actually delivered before a program waits for input. What constitutes an interactive device is implementation-defined.
These collectively are referred to as the observable behavior of the program.
To summarize: the compiler can generate whatever executable it likes so long as that executable performs the same I/O and access to volatile variables as the unoptimized version would. In particular, there are no requirements about timing or memory allocation.
In your code sample, the entire thing could be optimized out as it produces no observable behaviour. However, real-world compilers sometimes decide to leave in things that could be optimized out, if they think the programmer really wanted those operations to happen for some reason.
#Ikes answer is exactly what I was getting at. However, If you are curious about this question, I very much recommend reading answers of #dasblinkenlight, #Mankarse, and #Matt McNabb and the discussions below them, which get at the details of the situation. Thanks all.

How big before a member function typically not inlined?

How large do member functions need to be before the compiler decides against inlining them?
(Assume GCC and o2/o3 or any other high optimization switches).
I believe the larger a function becomes, the smaller the benefit of inlining.
One of the main purposes for inlining is to avoid the function call and return overhead.
Small functions, such as getters and setters, are prime candidates.
Larger functions are usually not inlined. This due to the ratio of data processing instructions to the size of the call and return overhead. The overhead is small compared to the content in the function. In general, removing the function call and return overhead will have a negligible impact on the program's performance.
As for the size threshold, it is compiler dependent. For GCC, you should look up the documentation or look at the decision point in the code.
The default threshold size for gcc inline is 600. You can change that with the flag -finline-limit
Notice that the measure of function size is not directly made in bytes nor instruction count, but other measurement applied by the compiler, which makes pretty hard to determine whether a function will or not be inlined. You can assume based on functions inlined to certain value of -finline-limit the inlineability of other sized functions to other values of the flag.
Then again, size alone doesn't guarantee inline, though, and decision may vary with other flags/ compiler version.
There is a good reason for this mess: since inlining is mostly due to performance, the compiler adjusts it's measures to target architecture/function usage.
Source:
GCC Optimization Options

Why are C++ function calls cheap?

While reading Stroustrup's "The C++ Programming Language", I came across this sentence on p. 108:
"The style of syntax analysis used is usually called recursive descent; it is a popular and straightforward top-down technique. In a language such as C++, in which function calls are relatively cheap, it is also efficient."
Can someone explain why C++ function calls are cheap? I'd be interested in a general explanation, i.e. what makes a function call cheap in any language, as well, if that's possible.
Calling C or C++ functions (in particular when they are not virtual) is quite cheap since it involves only a few machine instructions, and a jump (with link to return address) to a known location.
On some other languages (e.g. Common Lisp, when applying an unknown variadic function), it may be more complex.
Actually, you should benchmark: many recent processors are out-of-order & superscalar, so are doing "several things at a time".
However, optimizing compilers are capable of marvellous tricks.
For many functional languages, a called function is in general a closure, and need some indirection (and also passing the closed values).
Some object oriented languages (like Smalltalk) may involve searching a dictionary of methods when invoking a selector (on an arbitrary receiver).
Interpreted languages may have a quite larger function call overhead.
Function calls are cheap in C++ compared to most other languages for one reason: C++ is built upon the concept of function inlining, whereas (for example) java is built upon the concept of everything-is-a-virtual-function.
In C++, most of the time you're calling a function, you're not actually generating an call instruction. Especially when calling small or template functions, the compiler will most likely inline the code. In such case the function call overhead is simply zero.
Even when the function is not inlined, the compiler can make assumptions about what the function does For example: the windows X64 calling convention specifies that the registers R12-R15, XMM6-XMM15 should be saved by the caller. When calling a function, the compiler must generate code at the call site to save and restore these registers. But if the compiler can prove that the registers R12-R15, XMM6-XMM15 are not used by the called function such code can be omitted. This optimization is much harder when calling a virtual function.
Sometimes inlining is not possible. Common reasons include the function body not being available at compile time, of the function being too large. In that case the compiler generates an direct call instruction. However because the call target is fixed, the CPU can prefetch the instructions quite well. Although direct function calls are fast, there is still some overhead because the caller needs to save some registers on the stack, increase the stack pointer, etc.
Finally, when using an java function call or C++ function with the virtual keyword, the CPU will execute an virtual call instruction. The difference with an direct call is that the target is not fixed, but instead stored in in memory. The target function may change during the runtime of the program, which means that the CPU cannot always prefetch the data at the function location. Modern CPU's and JIT compilers have various tricks up their sleeve to predict the location of the target function, but it is still not as fast as direct calls.
tldr: function calls in C++ are fast because C++ implements inlining and by default uses direct calls over virtual calls. Many other languages do not implement inlining as well as C++ does and utilize virtual functions by default.
The cost of a function call is associated with the set of operations required to go from a given scope to another, i.e., from a current execution to the scope of another function. Consider the following code:
void foo(int w) { int x, y, z; ...; }
int main() { int a, b, c; ...; foo(b); ...; }
The execution starts in main(), and you may have some variables loaded into registers/memory. When you reach foo(), the set of variables available for use is different: a, b, c values are not reachable by function foo() and, in case you run out of available registers, the values stored will have to be spilled to memory.
The issue with registers appears in any language. But some languages needs more complex operations to change from scope to scope: C++ simply pushes up whatever is required by the function into the memory stack, maintaining pointers for surrounding scopes (in this case, while running foo(), you'd be able to reach the definition of w in main()'s scope.
Other languages must allocate and pass forth complex data to allow access for surrounding scope variables. These extra allocations, and even searches for specific labels within the surrounding scopes, can raise the cost of function calls considerably.