I have two functions void f(int x){...} and void g(int x){f(x);}. I know 99% of the time g() receives 3 or 5. In f(), x never changes and supervises lots of loops and condition branches. Would the following be faster than my original code ?
void g(int x)
{
if(x == 3) f(3);
else if(x == 5) f(5);
else f(x);
}
Would the compiler (g++ -Ofast) compile f(3) and f(5) seperately from f(x), analogous to compiling two template parameters ? What else should I do to let the compiler acknowledge the optimization opportunity more easily ? Is declaring void f(const int &x){...} helpful or necessary ?
Answers to such questions are ultimately misleading, because they depend not only on the exact environment you use, but also on the other code your project will link with, should link-time optimization be used. Furthermore, the compiler can generate multiple versions — some more “optimal”, and then the “optimality” depends on who’s calling g(). If g() is constexpr - make it so, the compiler could use that fact to guide optimizations.
In any case: you need to look at the output of your compiler; with the code as it is compiled into your project. Only then you can tell. As a prelude, you should head to Compiler Explorer at https://godbolt.org and see for yourself in an isolated environment.
If this is a performance critical function, and 99% of the time f(3) or f(5) is called, and you are trying to optimize, you should measure the difference of such calls. If f() is an inline function, the optimizer may be able to work with your constant better than a variable, to make some of its functionality evaluate at compile time (such as constant folding, strength reduction, etc.) It might be useful Godbolt.org to look at the assembly and see if any obvious improvements occur. LTO may also help even if it's not inlined, though different people report different levels of success with this.
If you don't see much improvement but think there could be some knowing x in advance, you could also consider writing different specialized versions of f(), such as f3() and f5() which are optimized for those cases (though you might also end up with a larger instruction and have icache issues. It all comes down to measuring what you try and seeing where the benefits (and losses) are. Most important thing is to measure. It's no fun making code complicated for no gain (or worse, to slow it down in the name of optimization.)
Related
Is something like this code considered a bad practice?
If so, what should I do when func1 duplicates func2's behavior and I need both functions to be present (isn't that considered code redundancy)?!
UPD: Sorry for my bad illustration, I'll try to explain the question more clearly.
What i wanted to ask about is that:
I'm trying to design an optimized class that heavily calls two methods func1 and func2, func1's implementation uses func2 and i want the two methods calls to be inlined as much as possible, So is it better to call func2 from func1 like this code or to implement both independently.
inline int func2(int x) {
return x * (x + 2);
}
inline int func1(int x) {
return x * (x + 1) * func2(x + 2);
}
Writing several small functions is fine if it avoids writing the same code more than once. Some may argue that too many small functions makes code hard to read and that's a matter of opinion.
If you are worried about performance, the compiler will inline if it thinks it will help, you shouldn't worry about it until you've proven that there is a problem. See this question on premature optimization.
There's no problem in a function calling another function. You'll see real programs go much deeper than 2 calls, if you sample them.
As far as inlining, that's also no problem. An optimizing compiler would typically inline func2 (assuming its definition is visible and optimizations are enabled). Many common compilers and optimizers are smart about inlining. They often know when to inline and when not to inline -- all without your assistance.
Writing small functions is not a bad practice. Clarity and intent are typically of a higher importance than micro-optimizations. Under typical circumstances, there's nothing wrong with your example.
If it helps readability of your code, then yes. You should almost always aim for readability of your code. Don't forget to properly name your functions so other people will easily understand what is that function doing. And by other people I mean you too in few weeks or months. As they say you write code once but read it many times.
As for the performance, modern compilers know when to inline the function and you should not worry about it. In cases where it really matters you will just use the profiler to find the hostspot and eventually change it. But it will happen much fewer times than you think. You will almost always find better ways to optimize your code.
If both implemented in same scope, than compile can even do some algebraic optimization without inline. Some time ago I was very surprised when see that compiler sometime replace big and complex structures with simple calls of destination functions (kinda arguments carrying for d3d api). So, if you worry about performance, than just don't... at least yours app benchmarks are really bad.
On the other hand, it's all about relations: if func1 not really logically related to func2, only code\math kinda same, than better to copy func2 into func1. Why? Because func2 may be changed, but you forget about func1 and broke it, because they was related not by internal domain logic.
UPD after UPD
If all about speed and their is only math, than wrote in func1 fully optimized expression and don't rely to compiler. But it's if you really know that performance are on first place.
Out of habit I often write function definitions inline for simple functions such as this (contrived example)
class PositiveInteger
{
private:
long long unsigned m_i;
public:
PositiveInteger (int i);
};
inline PositiveInteger :: PositiveInteger (int i)
: m_i (i)
{
if (i < 0)
throw "oops";
}
I generally like to separate interface files and implementation files but, nevertheless, this is my habit for those functions which the voice in my head tells me will probably be hit a lot in hot spots.
I know the advice is "profile first" and I agree but I could avoid a whole load of profiling effort if I knew a priori that the compiler would produce identical final object code whether functions like this were inlined at compilation or link time. (Also, I believe the injected profiling code itself can cause a change in timing which swamps the effect of very simple functions such as the one above.)
GCC 5.1 has just been released advertising LTO (link time optimization) improvements. How good are they really? What kinds of functions can I safely un-inline knowing the final executable will not be affected?
You already answered your own question: Unless you're targeting an embedded system of some sort with restricted resources, write the code for clarity and maintainability first. Then if performance isn't acceptable you can profile and target your efforts towards the actual hotspots. Think about it: If you write clearer code that takes an extra 250ns that's not noticeable in your use case then the extra time doesn't matter.
GCC with LTO does cross-module inlining, so most of the time you should not see difference in code quality. Offline function declarations are also not duplicated across translation units and compile faster/produce smaller object files.
GCC's inlining heuristics however consider "inline" keyword as a hint that function is probably good to be inlined and increase limits on function size. Similarly it will take bit of extra hint for functions declared in the same translation unit as called. For small functions like one in your example this should not however make any difference.
It is my understanding that modern c++ compilers take shortcuts on things like:
if(true)
{do stuff}
But how about something like:
bool foo(){return true}
...
if(foo())
{do stuff}
Or:
class Functor
{
public:
bool operator() () { return true;}
}
...
Functor f;
if(f()){do stuff}
It depends if the compiler can see foo() in the same compilation unit.
With optimization enabled, if foo() is in the same compilation unit as the callers, it will probably inline the call to foo() and then optimization is simplified to the same if (true) check as before.
If you move foo() to a separate compilation unit, the inlining can no longer happen, so most compilers will no longer be able to optimize this code. (Link-time optimization can optimize across compilation units, but it's a lot less common--not all compilers support it and in general it's less effective.)
I've just tried g++ 4.7.2 with -O3, and in both examples it optimizes out the call. Without -O, it doesn't.
Modern compilers are incredibly clever, and often do "whole program optimization". So as long as you do sensible things, it definitely will optimise away function calls that just return a constant value. The compiler will also inline code that is only called once [even if it is very large], so writing small functions instead of large ones is definitely worth doing. Of course, using the function multiple times, it may not inline it, but then you get better cache hitrate from having the same function called from two places and smallr code overall.
Say I have some functions, each of about two simple lines of code, and they call each other like this: A calls B calls C calls D ... calls K. (So basically it's a long series of short function calls.) How deep will compilers usually go in the call tree to inline these functions?
The question is not meaningful.
If you think about inlining, and its consequences, you'll realise it:
Avoids a function call (with all the register saving/frame adjustment)
Exposes more context to the optimizer (dead stores, dead code, common sub-expression elimintation...)
Duplicates code (bloating the instruction cache and the executable size, among other things)
When deciding whether to inline or not, the compiler thus performs a balancing act between the potential bloat created and the speed gain expected. This balancing act is affected by options: for gcc -O3 means optimize for speed while -Oz means optimize for size, on inlining they have quasi opposite behaviors!
Therefore, what matters is not the "nesting level" it is the number of instruction (possibly weighted as not all are created equal).
This means that a simple forwarding function:
int foo(int a, int b) { return foo(a, b, 3); }
is essentially "transparent" from the inlining point of view.
One the other hand, a function counting a hundred lines of code is unlikely to get inlined. Except that a static free functions called only once are quasi systematically inlined, as it does not create any duplication in this case.
From this two examples we get a hunch of how the heuristics behave:
the less instructions the function have, the better for inling
the less often it is called, the better for inlining
After that, they are parameters you should be able to set to influence one way or another (MSVC as __force_inline which hints strongly at inling, gcc as they -finline-limit flag to "raise" the treshold on the instruction count, etc...)
On a tangent: do you know about partial inlining ?
It was introduced in gcc in 4.6. The idea, as the name suggests, is to partially inline a function. Mostly, to avoid the overhead of a function call when the function is "guarded" and may (in some cases) return nearly immediately.
For example:
void foo(Bar* x) {
if (not x) { return; } // null pointer, pfff!
// ... BIG BLOC OF STATEMENTS ...
}
void bar(Bar* x) {
// DO 1
foo(x);
// DO 2
}
could get "optimized" as:
void foo#0(Bar* x) {
// ... BIG BLOC OF STATEMENTS ...
}
void bar(Bar* x) {
// DO 1
if (x) { foo#0(x); }
// DO 2
}
Of course, once again the heuristics for inlining apply, but they apply more discriminately!
And finally, unless you use WPO (Whole Program Optimization) or LTO (Link Time Optimization), functions can only be inlined if their definition is in the same TU (Translation Unit) that the call site.
I've seen compilers inline more than 5 functions deep. But at some point, it basically becomes a space-efficiency trade-off that the compiler makes. Every compiler is different in this aspect. Visual Studio is very conservative with inlining. GCC (under -O3) and the Intel Compiler love to inline...
Visual Studio includes support for __forceinline. The Microsoft Visual Studio 2005 documentation states:
The __forceinline keyword overrides
the cost/benefit analysis and relies
on the judgment of the programmer
instead.
This raises the question: When is the compiler's cost/benefit analysis wrong? And, how am I supposed to know that it's wrong?
In what scenario is it assumed that I know better than my compiler on this issue?
You know better than the compiler only when your profiling data tells you so.
The one place I am using it is licence verification.
One important factor to protect against easy* cracking is to verify being licenced in multiple places rather than only one, and you don't want these places to be the same function call.
*) Please don't turn this in a discussion that everything can be cracked - I know. Also, this alone does not help much.
The compiler is making its decisions based on static code analysis, whereas if you profile as don says, you are carrying out a dynamic analysis that can be much farther reaching. The number of calls to a specific piece of code is often largely determined by the context in which it is used, e.g. the data. Profiling a typical set of use cases will do this. Personally, I gather this information by enabling profiling on my automated regression tests. In addition to forcing inlines, I have unrolled loops and carried out other manual optimizations on the basis of such data, to good effect. It is also imperative to profile again afterwards, as sometimes your best efforts can actually lead to decreased performance. Again, automation makes this a lot less painful.
More often than not though, in my experience, tweaking alogorithms gives much better results than straight code optimization.
I've developed software for limited resource devices for 9 years or so and the only time I've ever seen the need to use __forceinline was in a tight loop where a camera driver needed to copy pixel data from a capture buffer to the device screen. There we could clearly see that the cost of a specific function call really hogged the overlay drawing performance.
The only way to be sure is to measure performance with and without. Unless you are writing highly performance critical code, this will usually be unnecessary.
For SIMD code.
SIMD code often uses constants/magic numbers. In a regular function, every const __m128 c = _mm_setr_ps(1,2,3,4); becomes a memory reference.
With __forceinline, compiler can load it once and reuse the value, unless your code exhausts registers (usually 16).
CPU caches are great but registers are still faster.
P.S. Just got 12% performance improvement by __forceinline alone.
The inline directive will be totally of no use when used for functions which are:
recursive,
long,
composed of loops,
If you want to force this decision using __forceinline
Actually, even with the __forceinline keyword. Visual C++ sometimes chooses not to inline the code. (Source: Resulting assembly source code.)
Always look at the resulting assembly code where speed is of importance (such as tight inner loops needed to be run on each frame).
Sometimes using #define instead of inline will do the trick. (of course you loose a lot of checking by using #define, so use it only when and where it really matters).
Actually, boost is loaded with it.
For example
BOOST_CONTAINER_FORCEINLINE flat_tree& operator=(BOOST_RV_REF(flat_tree) x)
BOOST_NOEXCEPT_IF( (allocator_traits_type::propagate_on_container_move_assignment::value ||
allocator_traits_type::is_always_equal::value) &&
boost::container::container_detail::is_nothrow_move_assignable<Compare>::value)
{ m_data = boost::move(x.m_data); return *this; }
BOOST_CONTAINER_FORCEINLINE const value_compare &priv_value_comp() const
{ return static_cast<const value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE value_compare &priv_value_comp()
{ return static_cast<value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE const key_compare &priv_key_comp() const
{ return this->priv_value_comp().get_comp(); }
BOOST_CONTAINER_FORCEINLINE key_compare &priv_key_comp()
{ return this->priv_value_comp().get_comp(); }
public:
// accessors:
BOOST_CONTAINER_FORCEINLINE Compare key_comp() const
{ return this->m_data.get_comp(); }
BOOST_CONTAINER_FORCEINLINE value_compare value_comp() const
{ return this->m_data; }
BOOST_CONTAINER_FORCEINLINE allocator_type get_allocator() const
{ return this->m_data.m_vect.get_allocator(); }
BOOST_CONTAINER_FORCEINLINE const stored_allocator_type &get_stored_allocator() const
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE stored_allocator_type &get_stored_allocator()
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE iterator begin()
{ return this->m_data.m_vect.begin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator begin() const
{ return this->cbegin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator cbegin() const
{ return this->m_data.m_vect.begin(); }
There are several situations where the compiler is not able to determine categorically whether it is appropriate or beneficial to inline a function. Inlining may involve trade-off's that the compiler is unwilling to make, but you are (e.g,, code bloat).
In general, modern compilers are actually pretty good at making this decision.
When you know that the function is going to be called in one place several times for a complicated calculation, then it is a good idea to use __forceinline. For instance, a matrix multiplication for animation may need to be called so many times that the calls to the function will start to be noticed by your profiler. As said by the others, the compiler can't really know about that, especially in a dynamic situation where the execution of the code is unknown at compile time.
wA Case For noinline
I wanted to pitch in with an unusual suggestion and actually vouch for __noinline in MSVC or the noinline attribute/pragma in GCC and ICC as an alternative to try out first over __forceinline and its equivalents when staring at profiler hotspots. YMMV but I've gotten so much more mileage (measured improvements) out of telling the compiler what to never inline than what to always inline. It also tends to be far less invasive and can produce much more predictable and understandable hotspots when profiling the changes.
While it might seem very counter-intuitive and somewhat backward to try to improve performance by telling the compiler what not to inline, I'd claim based on my experience that it's much more harmonious with how optimizing compilers work and far less invasive to their code generation. A detail to keep in mind that's easy to forget is this:
Inlining a callee can often result in the caller, or the caller of the caller, to cease to be inlined.
This is what makes force inlining a rather invasive change to the code generation that can have chaotic results on your profiling sessions. I've even had cases where force inlining a function reused in several places completely reshuffled all top ten hotspots with the highest self-samples all over the place in very confusing ways. Sometimes it got to the point where I felt like I'm fighting with the optimizer making one thing faster here only to exchange a slowdown elsewhere in an equally common use case, especially in tricky cases for optimizers like bytecode interpretation. I've found noinline approaches so much easier to use successfully to eradicate a hotspot without exchanging one for another elsewhere.
It would be possible to inline functions much less invasively if we
could inline at the call site instead of determining whether or not
every single call to a function should be inlined. Unfortunately, I've
not found many compilers supporting such a feature besides ICC. It
makes much more sense to me if we are reacting to a hotspot to respond
by inlining at the call site instead of making every single call of a
particular function forcefully inlined. Lacking this wide support
among most compilers, I've gotten far more successful results with
noinline.
Optimizing With noinline
So the idea of optimizing with noinline is still with the same goal in mind: to help the optimizer inline our most critical functions. The difference is that instead of trying to tell the compiler what they are by forcefully inlining them, we are doing the opposite and telling the compiler what functions definitely aren't part of the critical execution path by forcefully preventing them from being inlined. We are focusing on identifying the rare-case non-critical paths while leaving the compiler still free to determine what to inline in the critical paths.
Say you have a loop that executes for a million iterations, and there is a function called baz which is only very rarely called in that loop once every few thousand iterations on average in response to very unusual user inputs even though it only has 5 lines of code and no complex expressions. You've already profiled this code and the profiler shows in the disassembly that calling a function foo which then calls baz has the largest number of samples with lots of samples distributed around calling instructions. The natural temptation might be to force inline foo. I would suggest instead to try marking baz as noinline and time the results. I've managed to make certain critical loops execute 3 times faster this way.
Analyzing the resulting assembly, the speedup came from the foo function now being inlined as a result of no longer inlining baz calls into its body.
I've often found in cases like these that marking the analogical baz with noinline produces even bigger improvements than force inlining foo. I'm not a computer architecture wizard to understand precisely why but glancing at the disassembly and the distribution of samples in the profiler in such cases, the result of force inlining foo was that the compiler was still inlining the rarely-executed baz on top of foo, making foo more bloated than necessary by still inlining rare-case function calls. By simply marking baz with noinline, we allow foo to be inlined when it wasn't before without actually also inlining baz. Why the extra code resulting from inlining baz as well slowed down the overall function is still not something I understand precisely; in my experience, jump instructions to more distant paths of code always seemed to take more time than closer jumps, but I'm at a loss as to why (maybe something to do with the jump instructions taking more time with larger operands or something to do with the instruction cache). What I can definitely say for sure is that favoring noinline in such cases offered superior performance to force inlining and also didn't have such disruptive results on the subsequent profiling sessions.
So anyway, I'd suggest to give noinline a try instead and reach for it first before force inlining.
Human vs. Optimizer
In what scenario is it assumed that I know better than my compiler on
this issue?
I'd refrain from being so bold as to assume. At least I'm not good enough to do that. If anything, I've learned over the years the humbling fact that my assumptions are often wrong once I check and measure things I try with the profiler. I have gotten past the stage (over a couple of decades of making my profiler my best friend) to avoid completely blind stabs at the dark only to face humbling defeat and revert my changes, but at my best, I'm still making, at most, educated guesses. Still, I've always known better than my compiler, and hopefully, most of us programmers have always known this better than our compilers, how our product is supposed to be designed and how it is is going to most likely be used by our customers. That at least gives us some edge in the understanding of common-case and rare-case branches of code that compilers don't possess (at least without PGO and I've never gotten the best results with PGO). Compilers don't possess this type of runtime information and foresight of common-case user inputs. It is when I combine this user-end knowledge, and with a profiler in hand, that I've found the biggest improvements nudging the optimizer here and there in teaching it things like what to inline or, more commonly in my case, what to never inline.