Why are C++ function calls cheap? - c++

While reading Stroustrup's "The C++ Programming Language", I came across this sentence on p. 108:
"The style of syntax analysis used is usually called recursive descent; it is a popular and straightforward top-down technique. In a language such as C++, in which function calls are relatively cheap, it is also efficient."
Can someone explain why C++ function calls are cheap? I'd be interested in a general explanation, i.e. what makes a function call cheap in any language, as well, if that's possible.

Calling C or C++ functions (in particular when they are not virtual) is quite cheap since it involves only a few machine instructions, and a jump (with link to return address) to a known location.
On some other languages (e.g. Common Lisp, when applying an unknown variadic function), it may be more complex.
Actually, you should benchmark: many recent processors are out-of-order & superscalar, so are doing "several things at a time".
However, optimizing compilers are capable of marvellous tricks.
For many functional languages, a called function is in general a closure, and need some indirection (and also passing the closed values).
Some object oriented languages (like Smalltalk) may involve searching a dictionary of methods when invoking a selector (on an arbitrary receiver).
Interpreted languages may have a quite larger function call overhead.

Function calls are cheap in C++ compared to most other languages for one reason: C++ is built upon the concept of function inlining, whereas (for example) java is built upon the concept of everything-is-a-virtual-function.
In C++, most of the time you're calling a function, you're not actually generating an call instruction. Especially when calling small or template functions, the compiler will most likely inline the code. In such case the function call overhead is simply zero.
Even when the function is not inlined, the compiler can make assumptions about what the function does For example: the windows X64 calling convention specifies that the registers R12-R15, XMM6-XMM15 should be saved by the caller. When calling a function, the compiler must generate code at the call site to save and restore these registers. But if the compiler can prove that the registers R12-R15, XMM6-XMM15 are not used by the called function such code can be omitted. This optimization is much harder when calling a virtual function.
Sometimes inlining is not possible. Common reasons include the function body not being available at compile time, of the function being too large. In that case the compiler generates an direct call instruction. However because the call target is fixed, the CPU can prefetch the instructions quite well. Although direct function calls are fast, there is still some overhead because the caller needs to save some registers on the stack, increase the stack pointer, etc.
Finally, when using an java function call or C++ function with the virtual keyword, the CPU will execute an virtual call instruction. The difference with an direct call is that the target is not fixed, but instead stored in in memory. The target function may change during the runtime of the program, which means that the CPU cannot always prefetch the data at the function location. Modern CPU's and JIT compilers have various tricks up their sleeve to predict the location of the target function, but it is still not as fast as direct calls.
tldr: function calls in C++ are fast because C++ implements inlining and by default uses direct calls over virtual calls. Many other languages do not implement inlining as well as C++ does and utilize virtual functions by default.

The cost of a function call is associated with the set of operations required to go from a given scope to another, i.e., from a current execution to the scope of another function. Consider the following code:
void foo(int w) { int x, y, z; ...; }
int main() { int a, b, c; ...; foo(b); ...; }
The execution starts in main(), and you may have some variables loaded into registers/memory. When you reach foo(), the set of variables available for use is different: a, b, c values are not reachable by function foo() and, in case you run out of available registers, the values stored will have to be spilled to memory.
The issue with registers appears in any language. But some languages needs more complex operations to change from scope to scope: C++ simply pushes up whatever is required by the function into the memory stack, maintaining pointers for surrounding scopes (in this case, while running foo(), you'd be able to reach the definition of w in main()'s scope.
Other languages must allocate and pass forth complex data to allow access for surrounding scope variables. These extra allocations, and even searches for specific labels within the surrounding scopes, can raise the cost of function calls considerably.

Related

Is there a particular reason for calling conventions with predetermined order of registers used for passing arguments?

For example, calling int func( int a, int * b...) will put a and b into registers r0, r1 and so on, call the function and return the result in r0 (remark: speaking very generally here, not related to any specific processor or calling convention; also, let's assume fast-call passing of arguments in registers).
Now, wouldn't it be better that each function is compiled with arguments passed in registers that are already preferred for their types of arguments (pointers to preferred pointer/base registers, array-like data to vector registers...), and not following one of few calling-convention rules using the more-or-less strict order of function arguments as given in a function prototype ?
This way the code might avoid few instructions used to shuffle arguments in registers before and after such a function call, and also often avoid reusing the same basic registers all over again etc. Of course, this would require some extra data for each function (presumably kept in a database or object files), but it would be beneficial even in case of dynamic linking.
So, are there any strong arguments against doing something like that in the 21st century, except maybe historical reasons ?
For functions that are called from a single place, or that are at the very least static so the compiler can know every call place (assuming that the compiler can prove that the address of that function is not passed around as a function pointer) this could be done.
The catch is that almost every time this can be done it is done, but in a slightly different way: the compiler can inline code, and once inlined, the calling convention is no longer needed, because there is no longer a call being made.
But let's get back to the data base idea: you could argue that this has no runtime cost. When generating the code the compiler checks the data base and generates the appropriate code. This doesn't really help in any way. You still have a calling convention, but instead of having one (or a few) that is respected by all code, you now have a different calling convention for each function. Sure, the compiler no longer needs to put the first argument in r0, but it needs to put it in r1 for foo, in r5 for bar, etc. There's still overhead for setting up the proper registers with the proper values. Knowing what registers to restore after such a function call also becomes harder. Calling convention specify clearly which registers are volatile (so their values are lost upon returning from a called function) and non-volatile (so their values are preserved).
A far more useful feature is to generate the code of the called function in such a way that it uses the registers that already happen to hold those values. This can happen when inlining code.
To add to this, I believe this is what Rust does in Rust-to-Rust calls. As far as I know, the language does not have a fixed calling convention. Instead, the compiler tries to generate code based on in which registers the values for the arguments are already present. Unfortunately I can't seem to find any official docs about this, but this rust-lang discussion may be of help.
Going one step further: not all code paths are known at compile time. Think about function pointers: if I have the following code:
typedef void (*my_function_ptr_t)(int arg1);
my_function_ptr_t get_function(int value) {
switch (value) {
case 0: return foo;
case 1: return bar;
default: return baz;
}
}
void do_some_stuff(int a, int b) {
my_function_ptr_t handler = get_function(a);
handler(b);
}
Under the data base proposal foo, bar, and baz can have completely different calling conventions. This either means that you can't actually have working function pointers anymore, or that the data base needs to be accessible at runtime, and the compiler will generate code that will check it at runtime in order to properly call a function through a function pointer. This can have some serious overhead, for no actual gains.
This is by no means an exhaustive list of reasons, but the main idea is this: by having each function expect arguments to be in different registers you replace one calling convention with many calling conventions without gaining anything from it.

Inlining a function that manipulates data on heap

I'm working on optimizing a code where most of the objects are allocated on heap.
What I'm trying to understand is: if/why the compiler might not inline a function call that potentially manipulates data on heap.
To make things more clear, suppose you have the following code:
class A
{
public:
void foo() // non-const function
{
// modify data
i++;
...
}
private:
int i;
// can be anything here, including pointers
};
int main()
{
A a; // allocate something on stack
auto ptr = std::make_unique<A>(); // allocate something on heap
a.foo(); // case 1
ptr->foo(); // case 2
return 0;
}
Is it possible that a.foo() gets inlined while ptr->foo() does not?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
Assume that there are no virtual functions
EDIT: I guess my question is partially theoretical. Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
That is not relevant. Inline function and "regular" function calls have the same effect on the heap.
The implementation, inline or not, is in the code segment anyway.
Is it possible that a.foo() gets inlined while ptr->foo() does not?
Highly unlikely. Both of these calls will be probably inlined if the implementation is visible to the compiler and the compiler decide that it would be beneficial.
I used "case 2" in my code numerous times and it was always inlined using g++.
Although it is mostly implementation specific, there are no real limitation that restrict pointer function call compared to calling using an on stack object (beside the virtual functions which you already mentioned).
You should note that the produced inlined code might still be different. Case 2 will have to first determine the actual address which will have an impact on the performance, but it should be pretty much the same from there.
if/why the compiler might not inline a function call that potentially manipulates data on heap.
The compiler is free to inline or not a function call (and might decide that after devirtualization). The inlining decision is the freedom of the compiler (so inline keyword, like register, is often ignored to make optimizing decisions). The compiler often would decide to inline (or not) every particular call (so every occurrence of the called function name).
Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
This is really easy. Often, (among other criteria) the inlining is decided according to the depth of previously inlined nested function calls, or according the current size of the expanded internal representation. So it does happen that a particular occurrence of ptr->foo() would be inlined (e.g. because it occurs in a small function) but another occurrence of a.foo() won't be inlined.
Remember, inlining decisions is generally taken at each call site. And on some compilers, the thresholds used by the compiler may vary or can be tuned.
But inlining does not always speed up execution time (because of CPU cache and branch predictor issues, and many other mysteries....), and that is yet another reason why sometimes a compiler won't inline a particular call.
For GCC compiler, read about inline functions and various optimization options (notice that -finline-limit=100 and -finline-limit=200 will give different inlining decisions; you could even play with different --params options; the MILEPOST GCC project used machine learning techniques to tune these....).
Perhaps some compilers can more easily do devirtualization for stack allocated data (I really don't know, and compilers are making progress on such issues). This is probably the reason why (perhaps!) heap vs stack allocation could influence inlining decisions.

Is there a reason some functions don't take a void*?

Many functions accept a function pointer as an argument. atexit and call_once are excellent examples. If these higher level functions accepted a void* argument, such as atexit(&myFunction, &argumentForMyFunction), then I could easily wrap any functor I pleased by passing a function pointer and a block of data to provide statefulness.
As is, there are many cases where I wish I could register a callback with arguments, but the registration function does not allow me to pass any arguments through. atexit only accepts one argument: a function taking 0 arguments. I cannot register a function to clean up after my object, I must register a function which cleans up after all objects of a class, and force my class to maintain a list of all objects needing cleanup.
I always viewed this as an oversight, there seemed no valid reason why you wouldn't allow a measly 4 or 8 byte pointer to be passed along, unless you were on an extremely limited microcontroller. I always assumed they simply didn't realize how important that extra argument could be until it was too late to redefine the spec. In the case of call_once, the posix version accepts no arguments, but the C++11 version accepts a functor (which is virtually equivalent to passing a function and an argument, only the compiler does some of the work for you).
Is there any reason why one would choose not to allow that extra argument? Is there an advantage to accepting only "void functions with 0 arguments"?
I think atexit is just a special case, because whatever function you pass to it is supposed to be called only once. Therefore whatever state it needs to do its job can just be kept in global variables. If atexit were being designed today, it would probably take a void* in order to enable you to avoid using global variables, but that wouldn't actually give it any new functionality; it would just make the code slightly cleaner in some cases.
For many APIs, though, callbacks are allowed to take additional arguments, and not allowing them to do so would be a severe design flaw. For example, pthread_create does let you pass a void*, which makes sense because otherwise you'd need a separate function for each thread, and it would be totally impossible to write a program that spawns a variable number of threads.
Quite a number of the interfaces taking function pointers lacking a pass-through argument are simply coming from a different time. However, their signatures can't be changed without breaking existing code. It is sort of a misdesign but that's easy to say in hindsight. The overall programming style has moved on to have limited uses of functional programming within generally non-functional programming languages. Also, at the time many of these interfaces were created storing any extra data even on "normal" computers implied an observable extra cost: aside from the extra storage used, the extra argument also needs to be passed even when it isn't used. Sure, atexit() is hardly bound to be a performance bottleneck seeing that it is called just once but if you'd pass an extra pointer everywhere you'd surely also have one qsort()'s comparison function.
Specifically for something like atexit() it is reasonably straight forward to use a custom global object with which function objects to be invoked upon exit are registered: just register a function with atexit() calling all of the functions registered with said global object. Also note that atexit() is only guaranteed to register up to 32 functions although implementations may support more registered functions. It seems ill-advised to use it as a registry for object clean-up function rather than the function which calling an object clean-up function as other libraries may have a need to register functions, too.
That said, I can't imagine why atexit() is particular useful in C++ where objects are automatically destroyed upon program termination anyway. Of course, this approach assumes that all objects are somehow held but that's normally necessary anyway in some form or the other and typically done using appropriate RAII objects.

What are the caveats to using inline optimization in C++ functions?

What would be the benefits of inlining different types of function and what are the issues that would I would need to watch out for when developing around them? I am not so useful with a profiler but many different algorithmic applications it seems to increase the speed 8 times over, if you can give any pointers that'd be of great use to me.
Inline functions are oft' overused, and the consequences are significant. Inline indicates to the compiler that a function may be considered for inline expansion. If the compiler chooses to inline a function, the function is not called, but copied into place. The performance gain comes in avoiding the function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware, that they can increase program size. They can increase execution time by reducing the caller's locality of reference. When sizes increase, the caller's inner loop may no longer fit in the processor cache, causing unnecessary cache misses and the consequent performance hit. Inline functions also increase build times - if inline functions change, the world must be recompiled. Some guidelines:
Avoid inlining functions until profiling indicates which functions could benefit from inline.
Consider using your compiler's option for auto-inlining after profiling both with and without auto-inlining.
Only inline functions where the function call overhead is large relative to the function's code. In other words, inlining large functions or functions that call other (possibly inlined) functions is not a good idea.
The most important pointer is that you should in almost all cases let the compiler do its thing and not worry about it.
The compiler is free to perform inline expansion of a function even if you do not declare it inline, and it is free not to perform inline expansion even if you do declare it inline. It's entirely up to the compiler, which is okay, because in most cases it knows far better than you do when a function should be expanded inline.
One of the reason the compiler does a better job inlining than the programmer is because the cost/benefit tradeoff is actually decided at the lowest level of machine abstraction: how many assembly instructions make up the function that you want to inline. Consider the ratio between the execution time of a typical non-branching assembly instruction versus a function call. This ratio is predictable to the machine code generator, so that's why the compiler can use that information to guide inlining.
The high level compiler will often try to take care of another opportunity for inlining: when a function B is only called from function A and never called from elsewhere. This inlining is not done for performance reason (assuming A and B are not small functions), but is useful in reducing linking time by reducing the total number of "functions" that need to be generated.
Added examples
An example of where the compiler performs massive inlining (with massive speedup) is in the compilation of the STL containers. The STL container classes are written to be highly generic, and in return each "function" only performs a tiny bit of operation. When inlining is disabled, for example when compiling in debug mode, the speed of STL containers drop considerably.
A second example would be when the callee function contains certain instructions that require the stack to be undisturbed between the caller and callee. This happens with SIMD instructions using intrinsics. Fortunately, the compilers are smart enough to automatically inline these callee functions because they can inspect whether SIMD assembly instructions are emitted and inline them to make sure the stack is undisturbed.
The bottom line
unless you are familiar with low-level profiling and are good at assembly programming/optimization, it is better to let the compiler do the job. The STL is a special case in which it might make sense to enable inlining (with a switch) even in debug mode.
The main benefits of inlining a function are that you remove the calling overhead and allow the compiler to optimize across the call boundaries. Generally, the more freedom you give the optimizer, the better your program will perform.
The downside is that the function no longer exists. A debugger won't be able to tell you're inside of it, and no outside code can call it. You also can't replace its definition at run time, as the function body exists in many different locations.
Furthermore, the size of your binary is increased.
Generally, you should declare a function static if it has no external callers, rather than marking it inline. Only let a function be inlined if you're sure there are no negative side effects.
Function call overhead is pretty small. A more significant advantage of inline functions is the ability to use "by reference" variables directly without needing an extra level of pointer indirection. A function which makes heavy use of parameters passed by reference may benefit greatly if its parameters devolve to simple variables or fields.

What are custom calling conventions?

What are these? And how am I affected by these as a developer?
Related:
What are the different calling conventions in C/C++ and what do each mean?
A calling convention describes how something may call another function. This requires parameters and state to be passed to the other function, so that it can execute and return control correctly. The way in which this is done has to be standardized and specified, so that the compiler knows how to order parameters for consumption by the remote function that's being called. There are several standard calling conventions, but the most common are fastcall, stdcall, and cdecl.
Usually, the term custom calling convention is a bit of a misnomer and refers to one of two things:
A non-standard calling convention or one that isn't in widespread use (e.g. if you're building an architecture from scratch).
A special optimization that a compiler/linker can perform that uses a one-shot calling convention for the purpose of improving performance.
In the latter case, this causes some values that would otherwise be pushed onto the stack to be stored in registers instead. The compiler will try to make this decision based on how the parameters are being used inside the code. For example, if the parameter is going to be used as a maximum value for a loop index, such that the index is compared against the max on each iteration to see if things should continue, that would be a good case for moving it into a register.
If the optimization is carried out, this typically reduces code size and improves performance.
And how am I affected by these as a developer?
From your standpoint as a developer, you probably don't care; this is an optimization that will happen automatically.
Each language, when calling a function, has a convention about what parameters will be passed in register variables vs on the stack, and how return values will be returned.
Sometimes a different convention than the standard one is used and that's referred to as a custom calling convention.
This is most common when interoperating between different languages. For example, C and Pascal have different conventions about how to pass parameters. From C's point of view, the Pascal calling convention could be referred to as a custom calling convention.
I don't think you really need to care.
Normal calling conventions are things like __stdcall and __fastcall. They determine how your call signature is converted into a stack layout, who (caller or callee) is responsible for saving and restoring registers, etc. For example, __fastcall should use more registers where __stdcall would use more stack.
Custom calling conventions are optimised specifically for a particular function and the way it is used. They only happen, IIRC, for functions that are local to a particular module. That's how the compiler knows so much about how it's used, and also how you know that no external caller needs to be able to specify the convention.
Based on that, your compiler will use them automatically where appropriate, your code will run slightly faster and/or take slightly less space, but you don't really need to worry about it.
Unless you are directly manipulating the stack or writing inline assembly referencing local variables, it does not affect you. Or if your interfacing with libraries linked with different calling conventions
What it is: Most compilers and such use a standard calling convention such as cdecl where function arguments are pushed in a certain order to the stack and such.