Why aren't all functions inlined? - c++

If the implementation of an inlined function is placed everywhere that the function is called in the code, and this saves two branch steps, shouldn't a programmer try to inline every function if they do not have to worry about space ?
To be more specific, I would think that executing the function body immediately would always be faster than branching to the function body, executing the function body, and branching back to where the function call was made.

shouldn't a programmer try to inline every function if they do not
have to worry about space ?
Yes. But in most real world applications, you do have to worry about space. Programs and data which take up less space are (generally speaking) faster. Read about cache misses. Of course, programs which execute fewer instructions are also (generally speaking) faster, which is why we inline. These two ideas are in direct conflict, and so a balance must be met. It is usually best to leave this balancing act up to the compiler.

No. One of the benefits of functions is to make code reusable. If a programmer inlines all his functions then the code base increases and maintenance increases. If the compiler chooses to inline the functions later for the sake of speed or efficiency this doesn't impact on the maintenance aspect, and retains the readability of the original code.

Adding to the above comments the inline functions are loaded at the start of your program which gives u idea that if all inlined then load becomes heavy on compiler which becomes bad on programmer side

Related

C++ Is there any difference in performance by calling a func with code instead of calling the code directly? (From python)

I am from Python and still new at c++.
Now I wonder if calling a function is slower in performance then calling the code of the func itself?
Some example.
struct mynum {
public:
int m_value = 0;
constexpr
int value() { return m_value; }
// Say we would create a func here.
// That wants to use the value of "m_value"
// Is it slower to use "value()" instead of "m_value"?
// Even if the difference is very small.
// Or is there indeed no difference because everything gets compiled.
void somefunc() {
if(value() == 0) {}
}
}
If the function body is available at the time it is called, there is a good chance the compiler will try to either automatically inline it (the "inline" keyword is just a hint) or leave it as a function body. In both cases you are probably in the best path as compilers are pretty good at this kind of decisions - or better than us.
If only the function prototype (the declaration) is known by the compiler and the body is defined in another compilation unit (*.cpp file) then there are a couple of hits you might take:
The processor pipeline (and speculative execution) might stall which may call you a few cycles although processors have become extremely efficient at these things in the past 10 years or so. Even dynamic branch optimization has become so good that there is no point rearranging the order or if/else like we used to do 20 years ago (still necessary for microprocessors though).
The register optimization will display a clean cut, which will affect some intensive calculations primarily. Basically the processor runs an optimization to decide in which registers the variables being used will reside on. When you make a call, only a couple of them will be guaranteed to be preserved, all the others will need to be reloaded when the function returns. If the number of variables active is large, that load/unload can affect performance but that is really rare.
If the function is a virtual method, the indirect lookup on the virtual table might add up to ten cycles. Compilers might de-virtualize a call if it knows exactly which class will be called however so this cost might be actually the same of a normal function. In more complex cases, with several layers of polymorphism then virtual calls might take up to 20 cycles. On my tests with 2 layers the cost is in average 5-7 cycles on an AMD Zen3 (Threadripper).
But overall if the function call is not virtual, the cost will be really negligible. There are programmers that swear by inlining everything but if my experience is worth note, I have programatically generated code 100% inlined and the same code compiled in separate and the performance was largely the same.
There is some function call overhead in C++, but a simple function like this that just returns a known variable will probably be compiled out and replaced with a reference to that variable.

Using inline for a void function that takes no parameters

Is there any potential performance gain in replacing
void foo(void){/*some statement*/}
with
inline void foo(void){/*some statement*/}
There is no function call and return overhead. It probably will prevent instruction cache from reloading. Compiler will be allowed to perform far more optimizations once function body will be inlined.
Some more explanations:
CPU will load instructions when it will see it needs them, so if function will be inlined then CPU can load whole code in one read causing less CPU stalls. But if this function is actually quite large and is not executed very often then inlining it might actually cause more harm because CPU will likely load more cache lines that it is necessary. Below is example:
if ( condition ) {
// do some logic here
}
else {
foo();
}
now if condition is mostly true, then its better if foo() is not inlined, if condition is mostly false, then its better if it is inlined. So to make your code more cache friendly you should actually find a most common path of execution and make it work with as little if-s and possibly little function calls.
Function call overhead in this case is caused by the need to save registers on the stack (how many depends on actuall code), incrementing stack pointer, and jumping to funciton code. After function is done CPU needs to restore stack and registers to its previous state. This is obviously a lot of work especially if function is called inside tight loop.
Finally its important to remember that inline is only a hint to compiler. As a programmer you have knowledge of how your code is executed, and you should use this knowledge to structure your code to make it more cache friendly.
Yes (apart from the small but non-zero cost of the call): the optimiser might be able to do more to optimise the inlined code than it could with a function call.
Not-inlined function call incurs some minor penality even if you do not pass any arguments back and forth. For example, the address where the function should return after finishing its execution needs to be stored somewhere.
If I were you, I would make no particular distinction between void-void functions and other type of functions when deciding if to inline it or not. There are other more significant aspects which decide if inline is helpful or not, for example:
frequency of the calls (higher number favors inlining)
size of the function (bigger size favors not inlining)
inline is completely orthogonal to parameter and return type.
The overhead cost of a function call -- creating a stack for the function, pushing the arguments to the stack, managing return value from the stack, and deleting the stack -- are avoided when you have an inline function. That would improve performance.
In your case, you don't have any input arguments and a return value. So the cost of a function call will be reduced a little bit. It will be still more expensive than an inline function.

How expensive are NULL pointer arguments?

In implementing a menu on an embedded system in C(++) (AVR-Gcc), I ended up with void function pointer that take arguments, and usually make use of them.
// void function prototype
void (*auxFunc)(char *);
In some cases (in fact quite a few), the function actually doesn't need the argument, so I would do something like:
if (something) doAuxFunc(NULL);
I know I could just overload to a different function type, but I'm actually trying not to do this as I am instantiating multiple objects and want to keep them light.
Is calling multiple functions with NULL pointers (when they are intended for an actual pointer) worse than implementing many more function prototypes?
Checking for NULLs is a very small overhead even on a microcontroller - comparison against 0 is supposed to be lightning fast. If you overload several functions, you'll crucify readability for (a very slight) improvement in performance. Just let GCC's optimizer do its stuff, it's pretty good at it :)
Look at the disassembly, it should be generating a null (zero) to pass as the first argument, which either burns a register or a stack location, if it burns a register then it may cost you a push and pop if the calling function is starving for registers. (just using a function call may cost you pushes and pops if the function is starving for registers in order to implement the calling convention).
So there is likely a cost, but it may not be enough of a cost to change the way you do things.
Checking for 0 is really cheap, overloading is even cheaper, since it is decided at compile time which function to chose.
But if you think that your interfaces get too complicated with overloading and your function is small you should declare it inline and put it in a header. Checkig for 0 can then easily be optimized away by any decent modern compiler.
I think the "tradeoff" is ridiculously low for each approach but this is the time to do benchmarks for yourself. If you do so, please post some results :)

performance hit from lots of member functions in a C++ class?

Is there a significant performance hit if I keep adding member functions to a class? When I use the class, I may only use a couple of the member functions together at once so I could in theory split the single class into a number of smaller classes with fewer member functions. Do I take a big performance hit by cramming a lot of functions into the same class?
No, it doesn't matter.
But presence of lot of public` api indicates that you should make sure you are following the Single responsibility principle if you are trying to have a good design.
There should not be more than one reason for a class to change.
If you design adheres to that, then its all good and cramming a lot of functions to the class is not going to do any bad in terms of performance.
Some people might argue about performance hit if those functions are virtual however as long the purpose of those functions is to be overidden in the derived class then you should make them virtual, that is unless those functions are not being made virtual just for the sake of flexibility but on basis of a well thought design then go ahead and make them virtual.
Performance hits shouldn't be a concern its the price you pay for a functionality you want to have just that.
Also, Only profiling can actually give you accurate indications about performance bottle necks without it, what you get is speculations or guesses on the basis of experience which always might not be truly indicative.
The performance hit comes not from having lots of functions but having a deep hierarchy of functions calls to do a given task. Every function call results in
Pushing the pushing /saving current base pointer
Making saving current stack pointer and then making current stack pointer as the new base pointer.
Pushing the parameters on the stack.
Execute the function.
So e.g. If in sequence of execution if you end up calling 10functions you wind up stack 10 times and when the function(/s) is finished the stack has to be unwound.
Secondly there are gcc optimizations to reduce the cost of a jump to your function I the .Text e.g. By using an attribute called hot that improves the locality of such functions so that access of such functions is faster.
You always have to think about a function execution as *in the context of the thread executing it * so that you can identify various bottle necks and optimize them.
No, there shouldn't be and there wouldn't be any difference in binary size if you break up the class into smaller pieces. If you want to reduce binary size to only the function you call you can make your class a template.

Overhead of calling tiny functions from a tight inner loop? [C++]

Say you see a loop like this one:
for(int i=0;
i<thing.getParent().getObjectModel().getElements(SOME_TYPE).count();
++i)
{
thing.getData().insert(
thing.GetData().Count(),
thing.getParent().getObjectModel().getElements(SOME_TYPE)[i].getName()
);
}
if this was Java I'd probably not think twice. But in performance-critical sections of C++, it makes me want to tinker with it... however I don't know if the compiler is smart enough to make it futile.
This is a made up example but all it's doing is inserting strings into a container. Please don't assume any of these are STL types, think in general terms about the following:
Is having a messy condition in the for loop going to get evaluated each time, or only once?
If those get methods are simply returning references to member variables on the objects, will they be inlined away?
Would you expect custom [] operators to get optimized at all?
In other words is it worth the time (in performance only, not readability) to convert it to something like:
ElementContainer &source =
thing.getParent().getObjectModel().getElements(SOME_TYPE);
int num = source.count();
Store &destination = thing.getData();
for(int i=0;i<num;++i)
{
destination.insert(thing.GetData().Count(), source[i].getName());
}
Remember, this is a tight loop, called millions of times a second. What I wonder is if all this will shave a couple of cycles per loop or something more substantial?
Yes I know the quote about "premature optimisation". And I know that profiling is important. But this is a more general question about modern compilers, Visual Studio in particular.
The general way to answer such questions is to looked at the produced assembly. With gcc, this involve replacing the -c flag with -S.
My own rule is not to fight the compiler. If something is to be inlined, then I make sure that the compiler has all the information needed to perform such an inline, and (possibly) I try to urge him to do so with an explicit inline keyword.
Also, inlining saves a few opcodes but makes the code grow, which, as far as L1 cache is concerned, can be very bad for performance.
All the questions you are asking are compiler-specific, so the only sensible answer is "it depends". If it is important to you, you should (as always) look at the code the compiler is emitting and do some timing experiments. Make sure your code is compiled with all optimisations turned on - this can make a big difference for things like operator[](), which is often implemented as an inline function, but which won't be inlined (in GCC at least) unless you turn on optimisation.
If the loop is that critical, I can only suggest that you look at the code generated. If the compiler is allowed to aggressively optimise the calls away then perhaps it will not be an issue. Sorry to say this but modern compilers can optimise incredibly well and the I really would suggest profiling to find the best solution in your particular case.
If the methods are small and can and will be inlined, then the compiler may do the same optimizations that you have done. So, look at the generated code and compare.
Edit: It is also important to mark const methods as const, e.g. in your example count() and getName() should be const to let the compiler know that these methods do not alter the contents of the given object.
As a rule, you should not have all that garbage in your "for condition" unless the result is going to be changing during your loop execution.
Use another variable set outside the loop. This will eliminate the WTF when reading the code, it will not negatively impact performance, and it will sidestep the question of how well the functions get optimized. If those calls are not optimized this will also result in performance increase.
I think in this case you are asking the compiler to do more than it legitimately can given the scope of compile-time information it has access to. So, in particular cases the messy condition may be optimized away, but really, the compiler has no particularly good way to know what kind of side effects you might have from that long chain of function calls. I would assume that breaking out the test would be faster unless I have benchmarking (or disassembly) that shows otherwise.
This is one of the cases where the JIT compiler has a big advantage over a C++ compiler. It can in principle optimize for the most common case seen at runtime and provide optimized bytecode for that (plus checks to make sure that one falls into that case). This sort of thing is used all the time in polymorphic method calls that turn out not to actually be used polymorphically; whether it could catch something as complex as your example, though, I'm not certain.
For what it's worth, if speed really mattered, I'd split it up in Java too.