Inlining a function that manipulates data on heap - c++

I'm working on optimizing a code where most of the objects are allocated on heap.
What I'm trying to understand is: if/why the compiler might not inline a function call that potentially manipulates data on heap.
To make things more clear, suppose you have the following code:
class A
{
public:
void foo() // non-const function
{
// modify data
i++;
...
}
private:
int i;
// can be anything here, including pointers
};
int main()
{
A a; // allocate something on stack
auto ptr = std::make_unique<A>(); // allocate something on heap
a.foo(); // case 1
ptr->foo(); // case 2
return 0;
}
Is it possible that a.foo() gets inlined while ptr->foo() does not?
My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
Assume that there are no virtual functions
EDIT: I guess my question is partially theoretical. Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?

My guess is that this might be related to the fact the compiler does not have any guarantee that data on heap won't be modified by another thread. However, I don't understand if/why it can have any impact on inlining.
That is not relevant. Inline function and "regular" function calls have the same effect on the heap.
The implementation, inline or not, is in the code segment anyway.
Is it possible that a.foo() gets inlined while ptr->foo() does not?
Highly unlikely. Both of these calls will be probably inlined if the implementation is visible to the compiler and the compiler decide that it would be beneficial.
I used "case 2" in my code numerous times and it was always inlined using g++.
Although it is mostly implementation specific, there are no real limitation that restrict pointer function call compared to calling using an on stack object (beside the virtual functions which you already mentioned).
You should note that the produced inlined code might still be different. Case 2 will have to first determine the actual address which will have an impact on the performance, but it should be pretty much the same from there.

if/why the compiler might not inline a function call that potentially manipulates data on heap.
The compiler is free to inline or not a function call (and might decide that after devirtualization). The inlining decision is the freedom of the compiler (so inline keyword, like register, is often ignored to make optimizing decisions). The compiler often would decide to inline (or not) every particular call (so every occurrence of the called function name).
Suppose you are implementing a compiler, can you think of any legitimate reason why you won't optimize ptr->foo() while optimizing a.foo()?
This is really easy. Often, (among other criteria) the inlining is decided according to the depth of previously inlined nested function calls, or according the current size of the expanded internal representation. So it does happen that a particular occurrence of ptr->foo() would be inlined (e.g. because it occurs in a small function) but another occurrence of a.foo() won't be inlined.
Remember, inlining decisions is generally taken at each call site. And on some compilers, the thresholds used by the compiler may vary or can be tuned.
But inlining does not always speed up execution time (because of CPU cache and branch predictor issues, and many other mysteries....), and that is yet another reason why sometimes a compiler won't inline a particular call.
For GCC compiler, read about inline functions and various optimization options (notice that -finline-limit=100 and -finline-limit=200 will give different inlining decisions; you could even play with different --params options; the MILEPOST GCC project used machine learning techniques to tune these....).
Perhaps some compilers can more easily do devirtualization for stack allocated data (I really don't know, and compilers are making progress on such issues). This is probably the reason why (perhaps!) heap vs stack allocation could influence inlining decisions.

Related

Does it make sense to declare inline functions noexcept?

From what I can tell, the SO community is divided on whether declaring a function noexcept enables meaningful compiler optimizations that would not otherwise be possible. (I'm talking specifically about compiler optimizations, not library implementation optimizations based on move_if_noexcept.) For purposes of this question, let's assume that noexcept does make meaningful code-generation optimizations possible. With that assumption, does it make sense to declare inline functions noexcept? Assuming such functions are actually inlined, this would seem to require that compilers generate the equivalent of a try block around the code resulting from the inline function at the call site, because if an exception arises in that region, terminate must be called. Without noexcept, that try block would seem to be unnecessary.
My original interest was in whether it made sense to declare Lambda functions noexcept, given that they are implicitly inline, but then I realized that the same issues arise for any inline function, not just Lambdas.
let's assume that noexcept does make meaningful code-generation optimizations possible
OK
Assuming such functions are actually inlined, this would seem to
require that compilers generate the equivalent of a try block around
the code resulting from the inline function at the call site, because
if an exception arises in that region
Not necessarily, because it might be that the compiler can look at the function body and see that it cannot possibly throw anything. Therefore the nominal exception-handling can be elided.
If the function is "fully" inlined (that is, if the inlined code contains no function calls) then I would expect that the compiler can fairly commonly make this determination -- but not for example in a case where there's a call to vector::push_back() and the writer of the function knows that sufficient space has been reserved but the compiler doesn't.
Be aware also that in a good implementation a try block might not actually require any code at all to be executed in the case where nothing is thrown.
With that assumption, does it make sense to declare inline functions noexcept?
Yes, in order to get whatever the assumed optimizations are of noexcept.
It is worth noting that there was an interesting discussion in circles of power about nothrow-related issues. I highly recommend reading these:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3227.html
http://www.stroustrup.com/N3202-noexcept.pdf
Apparently, quite influential people are interested in adding some sort of automatic nothrow deduction to C++.
After some pondering I've changed my position to almost opposite, see below
Consider this:
when you call a function that has noexcept on declaration -- you benefit from this (no need to deal with unwindability, etc)
when compiler compiles a function that has noexcept on definion -- (unless compiler can prove that function is indeed nothrow) performance suffers (now compiler needs to ensure that no exception can escape this function). You are asking it to enforce no-exceptions promise
I.e. noexcept both hurts you and benefits you. Which is not the case if function is inlined! When it is inlined -- there is no benefit from noexcept on declaration whatsoever (declaration and definition become one thing)... That is unless you are actually want compiler to enforce this for safety sake. And by safety I mean you'd rather terminate than produce wrong result.
It should be obvious now -- there is no point declaring inlined functions noexcept (keep in mind that not every inline function is gonna get inlined).
Lets have a look at different categories of functions which don't throw (you just know they don't):
non-inlined, compiler can prove it doesn't throw -- noexcept won't hurt function body (compiler will simply ignore specification) and call sites will benefit from this promise
non-inlined, compiler can't prove it doesn't throw -- noexcept will hurt function body, but benefit call sites (hard to tell what is more beneficial)
inlined, compiler can prove it doesn't throw -- noexcept serves no purpose
inlined, compiler can't prove it doesn't throw -- noexcept will hurt call site
As you see, nothrow is simply badly designed language feature. It only works if you want to enforce no-exception promise. There is no way to use it correctly -- it can give you "safety", but not performance.
noexcept keyword ended up being used both as promise (on declaration) and enforcement(on definition) -- not a perfect approach, I think (lol, second stab at exception specs and we still didn't get it right).
So, what to do?
declare your behavior (alas, language has nothing to help you here)! E.g.:
void push_back(int k); // throws only if there is no unused memory
don't put noexcept on inline functions (unless it is unlikely to be inlined, e.g. very large)
for non-inline functions (or function that is unlikely to be inlined) -- make a call. The larger function gets the smaller noexcept's negative effect becomes (comparatively) -- at some point it probably makes sense specifying it for callers' benefit
use noexcept on move constructor and move assignment operator (and destructor?). It could affects them negatively, but if you don't -- certain library functions (std::swap, some container operations) won't take the most efficient path (or won't provide the best exception guarantee). Basically any place that uses noexcept operator on your function (as of now) will force you to use noexcept specifier.
use noexcept if you don't trust calls your function makes and rather die than have it behave unexpectedly
pure virtual functions -- more often than not you don't trust people implementing these interfaces. Often it makes sense buying insurance (by specifying noexcept)
Well, how else noexcept could be designed?
I'd use two different keywords -- one for declaring a promise and another for enforcing it. E.g. noexcept and force_noexcept. In fact, enforcement isn't really required -- it can be done with try/catch + terminate() (yes, it will unwind the stack, but who cares if it is followed by std::terminate()?)
I'd force compiler to analyze all calls in given function to determine if it can throw. If it does and a noexcept promise was made -- compiler error will be emitted
For code that can throw, but you know it doesn't there should be a way to assure compiler that it is ok. Smth like this:
vector<int> v;
v.reserve(1);
...
nothrow { // memory pre-allocated, this won't throw
v.push_back(10);
}
if promise is broken (i.e. someone changed vector code and now it provides other guarantees) -- undefined behavior.
Disclaimer: this approach could be too impractical, who knows...

Would the following code be optimized to one function call?

I have very little (read no) compiler expertise, and was wondering if the following code snippet would automatically be optimized by a relatively recent (VS2008+/GCC 4.3+) compiler:
Object objectPtr = getPtrSomehow();
if (objectPtr->getValue() == something1) // call 1
dosomething1;
else if (objectPtr->getValue() == something2) // call N (there are a few more)
dosomething2;
return;
where getValue() simply returns a member variable that is one of an enum. (The call has no observable effect)
My coding style would be to make one call before the "switch" and save the value to compare it against each of the somethingX's, but I was wondering if this was a moot point with today's compilers.
I was also unsure of what to google to find the answer to this myself.
Thank you,
AK
It's not moot, especially if the method is mutable.
If getValue is not declared const, the call can't be optimized away, as subsequent calls could return different values.
If it is declared const, it's easier, but also not trivial for the compiler to optimize the call. It would need access to the implementation, to make sure the call doesn't have side effects. There's also the chance that it returns a different value even if marked const (modifies and returns a global).
Unless the compiler can examine the definition of getValue() while it compiles that piece of code, it can't elide the second call because it doesn't know whether that call has observable effects and whether it returns the same value the second time around.
Even if it sees the definition, it probably (this is my wild guess from having a few peeks at some compilers' internals) won't go out of its way to check that. The only chance you stand is the implementation being trivial and inlined twice, and then caught by common subexpression elimination. EDIT: Since the definition is in the header, and quite small, it's likely that this (inlining and subsequent CSE) will ocurr. Still, if you want to be sure, check the output of g++ -O2 -S or your compiler's equivalent.
So in summary, you shouldn't expect the optimization to occur. Then again, getValue is probably quite cheap, so it's unlikely to be worth the manual optimizations. What's an extra line compared to a couple of machine cycles? Not much, in most cases. If you're writing code where it is much, you shouldn't be asking but just checking it (disassembly/profiling).
As other answers have noted, the compiler generally cannot eliminate the second call since there may be side effects.
However, some compilers have a way of telling the compiler that the function has no side effects and that this optimization is allowed. In GCC, a function may be declared pure. For example:
int square(int) __attribute__((pure));
says that the function has “no effects except to return a value, and [the] return value depends only on the parameters and/or global variables.”
You wrote:
My coding style would be to make one call before the "switch" and save the value to compare
it against each of the somethingX's, but I was wondering if this was a moot point
with today's compilers.
Yes, it's a moot point. What the compiler does is it's business. Your hands will be full trying to write maintainable code without trying to micromanage a piece of software that is far better at its job than any of us will ever hope to be.
Focus on writing maintainable code and trust the compiler to carry out its task. If your later find your code is too slow, then you can worry about optimizing.
Remember the proverb:
Premature optimization is the root of all evil.

What are the caveats to using inline optimization in C++ functions?

What would be the benefits of inlining different types of function and what are the issues that would I would need to watch out for when developing around them? I am not so useful with a profiler but many different algorithmic applications it seems to increase the speed 8 times over, if you can give any pointers that'd be of great use to me.
Inline functions are oft' overused, and the consequences are significant. Inline indicates to the compiler that a function may be considered for inline expansion. If the compiler chooses to inline a function, the function is not called, but copied into place. The performance gain comes in avoiding the function call, stack frame manipulation, and the function return. The gains can be considerable.
Beware, that they can increase program size. They can increase execution time by reducing the caller's locality of reference. When sizes increase, the caller's inner loop may no longer fit in the processor cache, causing unnecessary cache misses and the consequent performance hit. Inline functions also increase build times - if inline functions change, the world must be recompiled. Some guidelines:
Avoid inlining functions until profiling indicates which functions could benefit from inline.
Consider using your compiler's option for auto-inlining after profiling both with and without auto-inlining.
Only inline functions where the function call overhead is large relative to the function's code. In other words, inlining large functions or functions that call other (possibly inlined) functions is not a good idea.
The most important pointer is that you should in almost all cases let the compiler do its thing and not worry about it.
The compiler is free to perform inline expansion of a function even if you do not declare it inline, and it is free not to perform inline expansion even if you do declare it inline. It's entirely up to the compiler, which is okay, because in most cases it knows far better than you do when a function should be expanded inline.
One of the reason the compiler does a better job inlining than the programmer is because the cost/benefit tradeoff is actually decided at the lowest level of machine abstraction: how many assembly instructions make up the function that you want to inline. Consider the ratio between the execution time of a typical non-branching assembly instruction versus a function call. This ratio is predictable to the machine code generator, so that's why the compiler can use that information to guide inlining.
The high level compiler will often try to take care of another opportunity for inlining: when a function B is only called from function A and never called from elsewhere. This inlining is not done for performance reason (assuming A and B are not small functions), but is useful in reducing linking time by reducing the total number of "functions" that need to be generated.
Added examples
An example of where the compiler performs massive inlining (with massive speedup) is in the compilation of the STL containers. The STL container classes are written to be highly generic, and in return each "function" only performs a tiny bit of operation. When inlining is disabled, for example when compiling in debug mode, the speed of STL containers drop considerably.
A second example would be when the callee function contains certain instructions that require the stack to be undisturbed between the caller and callee. This happens with SIMD instructions using intrinsics. Fortunately, the compilers are smart enough to automatically inline these callee functions because they can inspect whether SIMD assembly instructions are emitted and inline them to make sure the stack is undisturbed.
The bottom line
unless you are familiar with low-level profiling and are good at assembly programming/optimization, it is better to let the compiler do the job. The STL is a special case in which it might make sense to enable inlining (with a switch) even in debug mode.
The main benefits of inlining a function are that you remove the calling overhead and allow the compiler to optimize across the call boundaries. Generally, the more freedom you give the optimizer, the better your program will perform.
The downside is that the function no longer exists. A debugger won't be able to tell you're inside of it, and no outside code can call it. You also can't replace its definition at run time, as the function body exists in many different locations.
Furthermore, the size of your binary is increased.
Generally, you should declare a function static if it has no external callers, rather than marking it inline. Only let a function be inlined if you're sure there are no negative side effects.
Function call overhead is pretty small. A more significant advantage of inline functions is the ability to use "by reference" variables directly without needing an extra level of pointer indirection. A function which makes heavy use of parameters passed by reference may benefit greatly if its parameters devolve to simple variables or fields.

zero Functor construction and overhead even with new and delete?

If I have a functor class with no state, but I create it from the heap with new, are typical compilers smart enough to optimize away the creation overhead entirely?
This question has come up when making a bunch of stateless functors. If they're allocated on the stack, does their 0 state class body mean that the stack really isn't changed at all? It seems it must in case you later take an address of the functor instance.
Same for heap allocation.
In that case, functors are always adding a (trivial, but non-zero) overhead in their creation. But maybe compilers can see whether the address is used and if not it can eliminate that stack allocation. (Or, can it even eliminate a heap allocation?)
But how about a functor that's created as a temporary?
#include <iostream>
struct GTfunctor
{
inline bool operator()(int a, int b) {return a>b; }
};
int main()
{
GTfunctor* f= new GTfunctor;
GTfunctor g;
std::cout<< (*f)(2,1) << std::endl;
std::cout<< g(2,1) << std::endl;
std::cout<< GTfunctor()(2,1) << std::endl;
delete f;
}
So in the concrete example above, the three lines each call the same functor in three different ways. In this example, is there any efficiency difference between the ways? Or is the compiler able to optimize each line all the way down to being a compute-less print statement?
Edit:
Most answers say that the compiler could never inline/eliminate the heap allocated functor. But is this really true as well? Most compilers (GCC, MS, Intel) have linktime optimization as well which could indeed do this optimization. (but does it?)
are typical compilers smart enough to optimize away the creation overhead entirely?
When you're creating them on the heap, I doubt whether the compiler is allowed to. IMO:
Invoking new implies invoking operator new.
operator new is a non-trivial function defined in the run-time library.
The compiler isn't allowed to decide that you didn't really mean to invoke such a function and to decide that as an optimization it will silently not invoke it.
When you're creating them on the stack, and not taking their address, then maybe ... or maybe not: my guess is that every object has a non-zero size, in order to occupy some memory, in order to have an identity, even when the object has no state apart from its identity.
Obviously, it depends on your compiler.
I would say
No compiler will optimize away the object on the heap. (This is because, as ChrisW says, compilers will never optimize away a call to new, which is almost certainly defined in another translation unit.)
Some compilers will optimize away a named object on the stack. I've known gcc to do this optimization quite often.
Most compilers will optimize away an unnamed object on the stack. This is one of the "standard" C++ optimizations, especially as more advanced C++ users tend to create lots of unnamed temporary variables.
Unfortunately, these are only rules of thumb. Optimizers are notoriously unpredictable; really the only way to know what your compiler is doing is to read the assembly output.
I highly doubt this type of optimization is allowed, but if your functor has no state, why would you want to initialize it on the heap? It should be just as easy to use it as a temporary.
A C++ object is always non-zero in size. "Empty base class optimization" allows empty base class to have zero size but that doesn't apply here.
I have not worked on any C++ optimizer, so whatever i say is just speculating. I think 2nd and 3rd will be expanded inline easily and there will be no overhead, and no GTFunctor is created. The functor pointer, however, is a different story. In your example, it may seem simple enough and any optimizer should be able to eliminate heap allocation, but in a non trivial program, you maybe creating the functors in one translation unit and use it in another. Or even in a different library where the compiler/linker/loader/runtime system don't have source code to, and it is almost impossible to optimize. Given the fact that optimizing it is not easy, the potential gain in performance is not great, and the number of cases where empty functor is allocated in the heap is probably small, i think most optimizer programmer will probably not put this optimization high in their to do list.
The compiler cannot optimize out a call to new or delete. It may however optimize out the variable created on the stack since it has no state.
Simple way to answer the heap question:
GTfunctor *f = new GTfunctor;
The value of f must not be null, so what should it be? And you also had:
GTfunctor *g = new GTfunctor;
Now the value of g must not equal the value of f, so what should each be?
Furthermore, neither f or g may be equal to any other pointer obtained from new, unless some pointer elsewhere is somehow initialised to be equal to f or g, which (depending on the code that comes after) may involve examining what the entire rest of the program does.
Yes, if by local examination of the code the compiler can see that you never rely on any of these requirements, then it could perform a rewrite such that no heap allocation occurs. The problem is, if your code was that simple, you could probably do that rewrite yourself and end up with a more readable program anyway, e.g. your test program would look like your stack-based g example. So real programs would not benefit from such an optimisation in the compiler.
Presumably the reason you're doing this is because sometimes the functor does have data, depending on which type is chosen at runtime. So compile-time analysis cannot usefully work its magic here.
C++ Standard states that each object (imho on the heap) must at least have a size one byte, so it can be uniquely addressed.
Generating functors with new can lead to two problems:
The constructions can generally not optimized away. New is a function with complex side effects (bad_alloc).
Because you address the functor indirectly the compiler may not be able to inline the function.
Chances are good that you will not see a sign of the functor, if you generate it on the stack.
Side note: The inline statement is not necessary. Every function which is defined in a class definition is treated as inlineable.
The compiler can probably figure out that operator() doesn't use any member variables, and optimize it to the max. I wouldn't make any assumptions about the local or heap allocated variables, though.
Edit: When in doubt, turn on the assembly output option on your compiler and see what it's actually doing. No sense in listening to a bunch of idiots on the web when you can see the real answer for yourself.
The answer to your question has two aspects.
Does the compiler optimize away the heap allocation: I strongly doubt it, but I'm not a standard guy, so I have to look it up.
Can the compiler optimize by inline the object's operator()? Yes. As long as you don't specify the call as virtual, even the pointer dereferencing isn't actually performed.

Will the c++ compiler optimize away unused return value?

If I have a function that returns an object, but this return value is never used by the caller, will the compiler optimize away the copy? (Possibly an always/sometimes/never answer.)
Elementary example:
ReturnValue MyClass::FunctionThatAltersMembersAndNeverFails()
{
//Do stuff to members of MyClass that never fails
return successfulResultObject;
}
void MyClass::DoWork()
{
// Do some stuff
FunctionThatAltersMembersAndNeverFails();
// Do more stuff
}
In this case, will the ReturnValue object get copied at all? Does it even get constructed? (I know it probably depends on the compiler, but let's narrow this discussion down to the popular modern ones.)
EDIT: Let's simplify this a bit, since there doesn't seem to be a consensus in the general case. What if ReturnValue is an int, and we return 0 instead of successfulResultObject?
If the ReturnValue class has a non-trivial copy constructor, the compiler must not eliminate the call to the copy constructor - it is mandated by the language that it is invoked.
If the copy constructor is inline, the compiler might be able to inline the call, which in turn might cause a elimination of much of its code (also depending on whether FunctionThatAltersMembersAndNeverFails is inline).
They most likely will if the optimization level causes them to inline the code. If not, they would have to generate two different translations of the same code to make it work, which could open up a lot of edge case problems.
The linker can take care of this sort of thing, even if the original caller and called are in different compilation units.
If you have a good reason to be concerned about the CPU load dedicated to a method call (premature optimization is the root of all evil,) you might consider the many inlining options available to you, including (gasp!) a macro.
Do you REALLY need to optimize at this level?
If return value is an int and you return 0 (as in the edited question), then this may get optimized away.
You have to look at the underlying assembly. If the function is not inlined then the underlying assembly will execute a mov eax, 0 (or xor eax, eax) to set eax (which is usually used for integer return values) to 0. If the function is inlined, this will certainly get optimized away.
But this senario isn't too useful if you're worried about what happens when you return objects larger than 32-bits. You'll need to refer to the answers to the unedit question, which paint a pretty good picture: If everything is inlined then most of it will be optimized out. If it is not inlined, then the functions must be called even if they don't really do anything, and that includes the constructor of an object (since the compiler doesn't know whether the constructor modified global variables or did something else weird).
I doubt most compilers could do that if they were in different compilation objects (ie. different files). Maybe if they were both in the same file, they could.
There is a pretty good chance that a peephole optimizer will catch this. Many (most?) compilers implement one, so the answer is probably "yes".
As others have notes this is not a trivial question at the AST rewriting level.
Peephole optimizers work on a representation of the code at a level equivalent to assembly language (but before generation of actual machine code). There is a chance to notice the load of the return value into a register followed by a overwrite with no intermediate read, and just remove the load. This is done on a case by case basis.
just tried this example on compiler explorer, and at -O3 the mov is not generated when the return value is not used.
https://gcc.godbolt.org/z/v5WGPr