How large do member functions need to be before the compiler decides against inlining them?
(Assume GCC and o2/o3 or any other high optimization switches).
I believe the larger a function becomes, the smaller the benefit of inlining.
One of the main purposes for inlining is to avoid the function call and return overhead.
Small functions, such as getters and setters, are prime candidates.
Larger functions are usually not inlined. This due to the ratio of data processing instructions to the size of the call and return overhead. The overhead is small compared to the content in the function. In general, removing the function call and return overhead will have a negligible impact on the program's performance.
As for the size threshold, it is compiler dependent. For GCC, you should look up the documentation or look at the decision point in the code.
The default threshold size for gcc inline is 600. You can change that with the flag -finline-limit
Notice that the measure of function size is not directly made in bytes nor instruction count, but other measurement applied by the compiler, which makes pretty hard to determine whether a function will or not be inlined. You can assume based on functions inlined to certain value of -finline-limit the inlineability of other sized functions to other values of the flag.
Then again, size alone doesn't guarantee inline, though, and decision may vary with other flags/ compiler version.
There is a good reason for this mess: since inlining is mostly due to performance, the compiler adjusts it's measures to target architecture/function usage.
Source:
GCC Optimization Options
Related
I'm working on optimizing a code where most of the objects are allocated on heap.
What I'm trying to understand is: if/why the compiler might not inline a function call that potentially manipulates data on heap.
To make things more clear, suppose you have the following code:
class A
{
public:
void foo() // non-const function
{
// modify data
i++;
...
}
private:
int i;
// can be anything here, including pointers
};
int main()
{
A a; // allocate something on stack
auto ptr = std::make_unique<A>(); // allocate something on heap
a.foo(); // case 1
ptr->foo(); // case 2
return 0;
}
Is it possible that a.foo() gets inlined while ptr->foo() does not?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
Assume that there are no virtual functions
EDIT: I guess my question is partially theoretical. Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
That is not relevant. Inline function and "regular" function calls have the same effect on the heap.
The implementation, inline or not, is in the code segment anyway.
Is it possible that a.foo() gets inlined while ptr->foo() does not?
Highly unlikely. Both of these calls will be probably inlined if the implementation is visible to the compiler and the compiler decide that it would be beneficial.
I used "case 2" in my code numerous times and it was always inlined using g++.
Although it is mostly implementation specific, there are no real limitation that restrict pointer function call compared to calling using an on stack object (beside the virtual functions which you already mentioned).
You should note that the produced inlined code might still be different. Case 2 will have to first determine the actual address which will have an impact on the performance, but it should be pretty much the same from there.
if/why the compiler might not inline a function call that potentially manipulates data on heap.
The compiler is free to inline or not a function call (and might decide that after devirtualization). The inlining decision is the freedom of the compiler (so inline keyword, like register, is often ignored to make optimizing decisions). The compiler often would decide to inline (or not) every particular call (so every occurrence of the called function name).
Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
This is really easy. Often, (among other criteria) the inlining is decided according to the depth of previously inlined nested function calls, or according the current size of the expanded internal representation. So it does happen that a particular occurrence of ptr->foo() would be inlined (e.g. because it occurs in a small function) but another occurrence of a.foo() won't be inlined.
Remember, inlining decisions is generally taken at each call site. And on some compilers, the thresholds used by the compiler may vary or can be tuned.
But inlining does not always speed up execution time (because of CPU cache and branch predictor issues, and many other mysteries....), and that is yet another reason why sometimes a compiler won't inline a particular call.
For GCC compiler, read about inline functions and various optimization options (notice that -finline-limit=100 and -finline-limit=200 will give different inlining decisions; you could even play with different --params options; the MILEPOST GCC project used machine learning techniques to tune these....).
Perhaps some compilers can more easily do devirtualization for stack allocated data (I really don't know, and compilers are making progress on such issues). This is probably the reason why (perhaps!) heap vs stack allocation could influence inlining decisions.
Given a struct like:
struct CryptoKey {
std::vector<unsigned char> key;
~CryptoKey() { memset(key.data(),0,key.size()); }
};
The compiler is entitled to eliminate the call to memset because this will save time, and no program with defined behaviour can tell the difference. (Given that the variable key will cease to exist once the destructor returns.)
Nevertheless, code like this is useful in cryptographic applications, because the less time that a secret is stored in memory, the less chance an attacker has to extract it. (The memset does not provide security, but it does provide "defence in depth".)
My question is, which real compilers actually do eliminate such memset calls (obviously, with optimization turned on)?
Perhaps it is better to say that a good compiler would attempt to eliminate the memset call and a developer should not rely on differences of compiler implementation to avoid this optimisation. These compilers typically have secure alternatives that will not be optimised.
Secure version of memset
C11 introduces memset_s which one of the characteristics is that is will not be optimised out.
Unlike memset, any call to the memset_s function shall be evaluated strictly according to the rules of the abstract machine as described in (5.1.2.3). That is, any call to the memset_s function shall assume that the memory indicated by s and n may be accessible in the future and thus must contain the values indicated by c.
Windows specific
On windows there are other choices. SecureZeroMemory or using a #pragma optimize pragma to turn off optimisation.
Common sub-expression optimisations
There is a broader issue with cryptographic safety: compilers are within their rights to copy buffers for optimisation reasons. Zeroing may not remove all copies, the compiler may have applied optimisations that copy the heap to the stack to eliminate common sub-expressions. So besides avoiding optimising out the zeroing, care should be taken that the compiler isn't inserting additional copies.
The problem for optimizers here is that your memset isn't writing to a member at all. Yes, key will cease to exist, but not so key.data. That memory will be returned to std::allocator. And std::allocator will very likely read adjacent memory to determine the memory block from which key.data came. Typical implementations store such data in the header of allocated blocks, i.e. at negative offsets. It's not unlikely that the header will be updated to reflect the block is free, or to coalesce the free block with other free blocks.
This may even be inlined, so the optimizer sees one function doing a memset and then the header access. It would be unreasonable to expect that the optimizer can figure out the memset is harmless. For all it knows, the allocator may be keeping a pool of zeroed blocks.
I've done my best to find an answer to this with no luck. Also, I've tested it and don't see any difference whatsoever in an optimized release build (there is a difference in debug)... still, I can't imagine why there is no difference, or how the optimizer is able to remove the penalty, and maybe someone knows what is happening internally.
If I create new instances of a simple class/struct within a loop, is there any penalty in efficiency for creating the class/struct on every loop iteration?
i.e.
struct mystruct
{
inline mystruct(const double &initial) : _myvalue(initial) {}
double myvalue;
}
why does...
for(int i=0; i<big_int; ++i)
{
mystruct a = mystruct(1.1)
}
take the same amount of real time as
for(int i=0; i<big_int; ++i)
{
double s = 1.1
}
?? Shouldn't there be some time required for the constructor/initialization?
This is easy-peasy work for a modern optimizer to handle.
As a programmer you might look at that constructor and struct and think it has to cost something. "The constructor code involves branching, passing arguments through registers/stack, popping from the stack, etc. The struct is a user-defined type, it must add more data somewhere. There's aliasing/indirection overhead for the const reference, etc."
Except the optimizer then has a go at your code, and it notices that the struct has no virtual functions, it has no objects that require a non-trivial constructor. The whole thing fits into a general-purpose register. And then it notices that your constructor is doing little more than assigning one variable to another. And it'll probably even notice that you're just calling it with a literal constant, which translates to a single move/store instruction to a register which doesn't even require any additional memory beyond the instruction.
It's all very magical, and compilers are sophisticated beasts, but they usually do this in multiple passes, and from your original code to intermediate representations, and from intermediate representations to machine code. To really appreciate and understand what they do, it's worth having a peek at the disassembly from time to time.
It's worth noting that C++ has been around for decades. As a successor to C, it originally was pushed mostly as an object-oriented language with hot concepts like encapsulation and information hiding. To promote a language where people start replacing public data members and manual initialization/destruction and things like that for simple accessor functions, constructors, destructors, it would have been very difficult to popularize the language if there was a measurable overhead in even a simple function call. So as magical as this all sounds, C++ optimizers have been doing this now for decades, squashing all that overhead you add to make things easier to maintain down to the same assembly as something which wouldn't be so easy to maintain.
So it's generally worth thinking of things like function calls and small structures as being basically free, since if it's worth inlining and squashing away all the overhead to zilch, optimizers will generally do it. Exceptions arise with indirect function calls: virtual methods, calls through function pointers, etc. But the code you posted is easy stuff for a modern optimizer to squash down.
C++ philosophy is that you should not "pay" (in CPU cycles or in memory bytes) for anything that you do not use. The struct in your example is nothing more than a double with a constructor tied to it. Moreover, the constructor can be inlined, bringing the overhead all the way down to zero.
If your struct had other parts to initialize, such as other fields or a table of virtual functions, there would be some overhead. The way your example is set up, however, the compiler can optimize out the constructor, producing an assembly output that boils down to a single assignment of a double.
Neither of your loops do anything. Dead code may be removed. Furthermore, there is no representational difference between a struct containing a single double and a primitive double. The compier should be able to easily "see through" an inline constructor. C++ relies on optimisations of these things to allow its abstractions to compete with hand-written versions.
There is no reason for the performance to be different, and if it were, I would consider it a bug (up to debug builds, where debug information could change the performance cost).
These quotes from the C++ Standard may help to understand what optimization is permitted:
The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.
and also:
The least requirements on a conforming implementation are:
Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place in such a fashion that prompting output is actually delivered before a program waits for input. What constitutes an interactive device is implementation-defined.
These collectively are referred to as the observable behavior of the program.
To summarize: the compiler can generate whatever executable it likes so long as that executable performs the same I/O and access to volatile variables as the unoptimized version would. In particular, there are no requirements about timing or memory allocation.
In your code sample, the entire thing could be optimized out as it produces no observable behaviour. However, real-world compilers sometimes decide to leave in things that could be optimized out, if they think the programmer really wanted those operations to happen for some reason.
#Ikes answer is exactly what I was getting at. However, If you are curious about this question, I very much recommend reading answers of #dasblinkenlight, #Mankarse, and #Matt McNabb and the discussions below them, which get at the details of the situation. Thanks all.
I have very little (read no) compiler expertise, and was wondering if the following code snippet would automatically be optimized by a relatively recent (VS2008+/GCC 4.3+) compiler:
Object objectPtr = getPtrSomehow();
if (objectPtr->getValue() == something1) // call 1
dosomething1;
else if (objectPtr->getValue() == something2) // call N (there are a few more)
dosomething2;
return;
where getValue() simply returns a member variable that is one of an enum. (The call has no observable effect)
My coding style would be to make one call before the "switch" and save the value to compare it against each of the somethingX's, but I was wondering if this was a moot point with today's compilers.
I was also unsure of what to google to find the answer to this myself.
Thank you,
AK
It's not moot, especially if the method is mutable.
If getValue is not declared const, the call can't be optimized away, as subsequent calls could return different values.
If it is declared const, it's easier, but also not trivial for the compiler to optimize the call. It would need access to the implementation, to make sure the call doesn't have side effects. There's also the chance that it returns a different value even if marked const (modifies and returns a global).
Unless the compiler can examine the definition of getValue() while it compiles that piece of code, it can't elide the second call because it doesn't know whether that call has observable effects and whether it returns the same value the second time around.
Even if it sees the definition, it probably (this is my wild guess from having a few peeks at some compilers' internals) won't go out of its way to check that. The only chance you stand is the implementation being trivial and inlined twice, and then caught by common subexpression elimination. EDIT: Since the definition is in the header, and quite small, it's likely that this (inlining and subsequent CSE) will ocurr. Still, if you want to be sure, check the output of g++ -O2 -S or your compiler's equivalent.
So in summary, you shouldn't expect the optimization to occur. Then again, getValue is probably quite cheap, so it's unlikely to be worth the manual optimizations. What's an extra line compared to a couple of machine cycles? Not much, in most cases. If you're writing code where it is much, you shouldn't be asking but just checking it (disassembly/profiling).
As other answers have noted, the compiler generally cannot eliminate the second call since there may be side effects.
However, some compilers have a way of telling the compiler that the function has no side effects and that this optimization is allowed. In GCC, a function may be declared pure. For example:
int square(int) __attribute__((pure));
says that the function has “no effects except to return a value, and [the] return value depends only on the parameters and/or global variables.”
You wrote:
My coding style would be to make one call before the "switch" and save the value to compare
it against each of the somethingX's, but I was wondering if this was a moot point
with today's compilers.
Yes, it's a moot point. What the compiler does is it's business. Your hands will be full trying to write maintainable code without trying to micromanage a piece of software that is far better at its job than any of us will ever hope to be.
Focus on writing maintainable code and trust the compiler to carry out its task. If your later find your code is too slow, then you can worry about optimizing.
Remember the proverb:
Premature optimization is the root of all evil.
What would be the benefits of inlining different types of function and what are the issues that would I would need to watch out for when developing around them? I am not so useful with a profiler but many different algorithmic applications it seems to increase the speed 8 times over, if you can give any pointers that'd be of great use to me.
Inline functions are oft' overused, and the consequences are significant. Inline indicates to the compiler that a function may be considered for inline expansion. If the compiler chooses to inline a function, the function is not called, but copied into place. The performance gain comes in avoiding the function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware, that they can increase program size. They can increase execution time by reducing the caller's locality of reference. When sizes increase, the caller's inner loop may no longer fit in the processor cache, causing unnecessary cache misses and the consequent performance hit. Inline functions also increase build times - if inline functions change, the world must be recompiled. Some guidelines:
Avoid inlining functions until profiling indicates which functions could benefit from inline.
Consider using your compiler's option for auto-inlining after profiling both with and without auto-inlining.
Only inline functions where the function call overhead is large relative to the function's code. In other words, inlining large functions or functions that call other (possibly inlined) functions is not a good idea.
The most important pointer is that you should in almost all cases let the compiler do its thing and not worry about it.
The compiler is free to perform inline expansion of a function even if you do not declare it inline, and it is free not to perform inline expansion even if you do declare it inline. It's entirely up to the compiler, which is okay, because in most cases it knows far better than you do when a function should be expanded inline.
One of the reason the compiler does a better job inlining than the programmer is because the cost/benefit tradeoff is actually decided at the lowest level of machine abstraction: how many assembly instructions make up the function that you want to inline. Consider the ratio between the execution time of a typical non-branching assembly instruction versus a function call. This ratio is predictable to the machine code generator, so that's why the compiler can use that information to guide inlining.
The high level compiler will often try to take care of another opportunity for inlining: when a function B is only called from function A and never called from elsewhere. This inlining is not done for performance reason (assuming A and B are not small functions), but is useful in reducing linking time by reducing the total number of "functions" that need to be generated.
Added examples
An example of where the compiler performs massive inlining (with massive speedup) is in the compilation of the STL containers. The STL container classes are written to be highly generic, and in return each "function" only performs a tiny bit of operation. When inlining is disabled, for example when compiling in debug mode, the speed of STL containers drop considerably.
A second example would be when the callee function contains certain instructions that require the stack to be undisturbed between the caller and callee. This happens with SIMD instructions using intrinsics. Fortunately, the compilers are smart enough to automatically inline these callee functions because they can inspect whether SIMD assembly instructions are emitted and inline them to make sure the stack is undisturbed.
The bottom line
unless you are familiar with low-level profiling and are good at assembly programming/optimization, it is better to let the compiler do the job. The STL is a special case in which it might make sense to enable inlining (with a switch) even in debug mode.
The main benefits of inlining a function are that you remove the calling overhead and allow the compiler to optimize across the call boundaries. Generally, the more freedom you give the optimizer, the better your program will perform.
The downside is that the function no longer exists. A debugger won't be able to tell you're inside of it, and no outside code can call it. You also can't replace its definition at run time, as the function body exists in many different locations.
Furthermore, the size of your binary is increased.
Generally, you should declare a function static if it has no external callers, rather than marking it inline. Only let a function be inlined if you're sure there are no negative side effects.
Function call overhead is pretty small. A more significant advantage of inline functions is the ability to use "by reference" variables directly without needing an extra level of pointer indirection. A function which makes heavy use of parameters passed by reference may benefit greatly if its parameters devolve to simple variables or fields.