Micro optimization - compiler optimization when accesing recursive members - c++

I'm interested in writing good code from the beginning instead of optimizing the code later. Sorry for not providing benchmark I don't have a working scenario at the moment. Thanks for your attention!
What are the performance gains of using FunctionY over FunctionX?
There is a lot of discussion on stackoverflow about this already but I'm in doubts in the case when accessing sub-members (recursive) as shown below. Will the compiler (say VS2008) optimize FunctionX into something like FunctionY?
void FunctionX(Obj * pObj)
void FunctionY(Obj * pObj)
W * localPtr = pObj->MemberQ->MemberW;

In case none of the member pointers are volatile or pointers to volatile and you don't have the operator -> overloaded for any members in a chain both functions are the same.
The optimization rule you suggested is widely known as Common Expression Elimination and is supported by vast majority of compilers for many decades.

In theory, you save on the extra pointer dereferences, HOWEVER, in the real world, the compiler will probably optimize it out for you, so it's a useless optimization.
This is why it's important to profile first, and then optimize later. The compiler is doing everything it can to help you, you might as well make sure you're not just doing something it's already doing.

if the compiler is good enough, it should translate functionX into something similar to functionY.
But you can have different result on different compiler and on the same compiler with different optimization flag.
Using a "dumb" compiler functionY should be faster, and IMHO it is more readable and faster to code. So stick with functionY
ps. you should take a look at some code style guide, normally member and function name should always start with a low-case letter


Is it good idea to write multi level inline functions in C++?

Is something like this code considered a bad practice?
If so, what should I do when func1 duplicates func2's behavior and I need both functions to be present (isn't that considered code redundancy)?!
UPD: Sorry for my bad illustration, I'll try to explain the question more clearly.
What i wanted to ask about is that:
I'm trying to design an optimized class that heavily calls two methods func1 and func2, func1's implementation uses func2 and i want the two methods calls to be inlined as much as possible, So is it better to call func2 from func1 like this code or to implement both independently.
inline int func2(int x) {
return x * (x + 2);
inline int func1(int x) {
return x * (x + 1) * func2(x + 2);
Writing several small functions is fine if it avoids writing the same code more than once. Some may argue that too many small functions makes code hard to read and that's a matter of opinion.
If you are worried about performance, the compiler will inline if it thinks it will help, you shouldn't worry about it until you've proven that there is a problem. See this question on premature optimization.
There's no problem in a function calling another function. You'll see real programs go much deeper than 2 calls, if you sample them.
As far as inlining, that's also no problem. An optimizing compiler would typically inline func2 (assuming its definition is visible and optimizations are enabled). Many common compilers and optimizers are smart about inlining. They often know when to inline and when not to inline -- all without your assistance.
Writing small functions is not a bad practice. Clarity and intent are typically of a higher importance than micro-optimizations. Under typical circumstances, there's nothing wrong with your example.
If it helps readability of your code, then yes. You should almost always aim for readability of your code. Don't forget to properly name your functions so other people will easily understand what is that function doing. And by other people I mean you too in few weeks or months. As they say you write code once but read it many times.
As for the performance, modern compilers know when to inline the function and you should not worry about it. In cases where it really matters you will just use the profiler to find the hostspot and eventually change it. But it will happen much fewer times than you think. You will almost always find better ways to optimize your code.
If both implemented in same scope, than compile can even do some algebraic optimization without inline. Some time ago I was very surprised when see that compiler sometime replace big and complex structures with simple calls of destination functions (kinda arguments carrying for d3d api). So, if you worry about performance, than just don't... at least yours app benchmarks are really bad.
On the other hand, it's all about relations: if func1 not really logically related to func2, only code\math kinda same, than better to copy func2 into func1. Why? Because func2 may be changed, but you forget about func1 and broke it, because they was related not by internal domain logic.
UPD after UPD
If all about speed and their is only math, than wrote in func1 fully optimized expression and don't rely to compiler. But it's if you really know that performance are on first place.

Can I selectively (force) inline a function?

In the book Clean Code (and a couple of others I have come across and read) it is suggested to keep the functions small and break them up if they become large. It also suggests that functions should do one thing and one thing only.
In Optimizing software in C++ Agner Fog states that he does not like the rule of breaking up a function just because it crosses a certain threshold of a number of lines. He states that this results in unnecessary jumps which degrade performance.
First off, I understand that it will not matter if the code I am working on is not in a tight loop and that the functions are heavy so that the time it takes to call them is dwarfed by the time the code in the function takes to execute. But let's assume that I am working with functions that are, most of the time, used by other objects/functions and are performing relatively trivial tasks. These functions follow the suggestions listed in the first paragraph (that is, perform one single function and are small/comprehensible). Then I start programming a performance critical function that utilizes these other functions in a tight loop and is essentially a frame function. Lastly, assume that in-lining them has a benefit for the performance critical function but no benefit whatsoever to any other function (yes, I have profiled this, albeit with a lot of copying and pasting which I want to avoid).
Immediately, one can say that tag the function inline and let the compiler choose. But what if I don't want all those functions to be in a `.inl file or exposed in the header? In my current situation, the performance critical functions and the other functions it uses are all in the same source file.
To sum it up, can I selectively (force) inline a function(s) for a single function so that the end code behaves like it is one big function instead of several calls to other functions.
There is nothing that prevents you to put inline in a static function in a .cpp file.
Some compilers have the option to force an inline function, see e.g. the GCC attribute((always_inline)) and a ton of options to fine tune the inlining optimizations (see -minline-* parameters).
My recommendation is to use inline or even better static inline wherever you see fit, and let the compiler decide. They usually do it pretty well.
You cannot force the inline. Also, function calls are pretty cheap on modern CPUs, compared to the cost of the work done. If your functions are large enough to need to be broken down, the additional time taken to do the call will be essentially nothing.
Failing that, you could ... try ... to use a macro.
No, inline is a recommendation to the compiler ; it does not force it to do anything. Also, if you're working with MSVC++, note that __forceinline is a misnomer as well ; it's just a stronger recommendation than inline.
This is as much about good old fashioned straight C as it is about C++. I was pondering this the other day, because in an embedded world, where both speed and space need to be carefully managed, this can really matter (as opposed to the all too oft "don't worry about it, your compiler is smart and memory is cheap prevalent in desktop/server development).
A possible solution that I have yet to vet is to basically use two names for the different variants, something like
inline int _max(int a, int b) {
return a > b ? a : b;
and then
int max(int a, int b) {
return _max(a, b);
This would give one the ability to selectively call either _max() or max() and yet still having the algorithm defined once-and-only-once.
Inlining – For example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.
VS Profile-Guided Optimizations
You can use the automated Profile Guided Optimization for Visual C++ plug-in in the Performance and Diagnostics Hub to simplify and streamline the optimization process within Visual Studio, or you can perform the optimization steps manually in Visual Studio or on the command line. We recommend the plug-in because it is easier to use. For information on how to get the plug-in and use it to optimize your app, see Profile Guided Optimization Plug-In.
If you have a known-hot function an want the compiler inline more aggressively than usual the flatten attribute offered by gcc/clang might be something to look into. In contrast to the inline keyword and attributes it applies to inlining decisions regarding the functions called in the marked function.
__attribute__((flatten)) void hot_code() {
// functions called here will be inlined if possible
See https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html and https://clang.llvm.org/docs/AttributeReference.html#flatten for official documentation.
Compilers are actually really really good at generating optimized code.
I would suggest just organizing your code into logical groupings (using additional functions if that enhanced readability), marking them inline if appropriate, and letting the compiler decide what code to optimally generate.
Quite surprised this hasn't been mention yet but as of now you can tell the compiler (I believe it may only work with GCC/G++) to force inline a function and ignore a couple restrictions associated with it.
You can do so via __attribute__((always_inline)).
Example of it in use:
inline __attribute__((always_inline)) int pleaseInlineThis() {
return 5;
Normally you should avoid forcing an inline as the compiler knows what's best better than you; however there are several use cases such as in OS/MicroController development where you need to inline calls where if it is instead called, would break the functionality.
C++ compilers usually aren't very friendly to controlled environments such as those without some hacks.
As people mentioned, you should avoid doing that as the compiler usually makes better decisions. There are several optimizations that you can enable to improve performance. These will inline the functions if needed:
LTO: link-time optimization or interprocedural optimization
Profile guided optimization: optimizations based on a runtime profile
BOLT: Binary Optimization and Layout Tool
Polly: a high-level loop and data-locality optimizer

Hints for the compiler to help it with the optimization task

The const and volatile chapter on the 'Surviving the Release Version' Article gave me the idea that the compiler can use the const keyword as hint for its optimization job.
Do you know some other optimization-hints for the compiler or design principles for functions so that the compiler can make them inline?
By the way, do you declare primitive-type function parameters as const or const reference (like void foo(const int i) or void foo(const int& i))?
It is rare that const qualification can help the compiler to optimize your code. You can read more about why this is the case in Herb Sutter's "Constant Optimization?"
Concerning your last question: in general, you should prefer to pass by value things that are cheap to copy (like fundamental type objects--ints and floats and such--and small class type objects) and pass other types by const reference. This is a very general rule and there are lots of caveats and exceptions.
As soon as you enable some optimization the compiler will notice that the parameter i is never modified, so whether you declare it as int or as const int doesn't matter for the generated code.
The point of passing parameters by const & is to avoid needless copying. In case of small parameters (one machine word or less) this doesn't lead to better performance, so you shouldn't do that. foo(int) is more efficient than foo(const int&).
There's no practical benefit to either form. If the type is less than a single machine word, take it by value. The other thing is that a modern compiler's semantic analysis is way above what const can and can't do, you could only apply optimizations if it was pre-compiled or your code was VERY complex. The article you linked to is several years old and the compiler has done nothing but improve massively since then.
I don't think the compiler can use const keyword for optimization since at any point the constness can be casted away.
It is more for correctness than optimization.
A few "general compiler" things off the top of my head.
const as a hint that a variable will never change
volatile as a hint that a variable can change at any point
restrict keyword
memory barriers (to hint to the compiler a specific ordering) - probably not so much an "optimisation" mind you.
inline keyword (use very carefully)
All that however should only come from an extensive profiling routine so you know what actually needs to be optimised. Compilers in general are pretty good at optimising without much in the way of hints from the programmer.
If you look into the sources of Linux kernel or some such similar projects, you will find all the optimisation clues that are passed on to gcc (or whichever compiler is used). Linux kernel uses every feature that gcc offers even if it is not in the standard.
This page sums up gcc's extensions to the C language. I referred C here because const and volatile are used in C as well. More than C or C++, compiler optimization appears the focus of the question here.
I don't think the real purpose of const has much to do with optimization, though it helps.
Isn't the real value in compile-time checking, to prevent you from modifying things you shouldn't modify, i.e. preventing bugs?
For small arguments that you are not going to modify, use call-by-value.
For large arguments that you are not going to modify, use either call-by-reference or passing the address (which are basically the same thing), along with const.
For large or small arguments that you are going to modify, drop the const.
BTW: In case it's news, for real performance, you need to know how to find the problems you actually have, by profiling. No compiler can do that for you.

c++ optimization

I'm working on some existing c++ code that appears to be written poorly, and is very frequently called. I'm wondering if I should spend time changing it, or if the compiler is already optimizing the problem away.
I'm using Visual Studio 2008.
Here is an example:
void someDrawingFunction(....)
Here is how I would do it:
void someDrawingFunction(....)
MyContext &c = GetContext();
Don't guess at where your program is spending time. Profile first to find your bottlenecks, then optimize those.
As for GetContext(), that depends on how complex it is. If it's just returning a class member variable, then chances are that the compiler will inline it. If GetContext() has to perform a more complicated operation (such as looking up the context in a table), the compiler probably isn't inlining it, and you may wish to only call it once, as in your second snippet.
If you're using GCC, you can also tag the GetContext() function with the pure attribute. This will allow it to perform more optimizations, such as common subexpression elimination.
If you're sure it's a performance problem, change it. If GetContext is a function call (as opposed to a macro or an inline function), then the compiler is going to HAVE to call it every time, because the compiler can't necessarily see what it's doing, and thus, the compiler probably won't know that it can eliminate the call.
Of course, you'll need to make sure that GetContext ALWAYS returns the same thing, and that this 'optimization' is safe.
If it is logically correct to do it the second way, i.e. calling GetContext() once on multiple times does not affect your program logic, i'd do it the second way even if you profile it and prove that there are no performance difference either way, so the next developer looking at this code will not ask the same question again.
Obviously, if GetContext() has side effects (I/O, updating globals, etc.) than the suggested optimization will produce different results.
So unless the compiler can somehow detect that GetContext() is pure, you should optimize it yourself.
If you're wondering what the compiler does, look at the assembly code.
That is such a simple change, I would do it.
It is quicker to fix it than to debate it.
But do you actually have a problem?
Just because it's called often doesn't mean it's called TOO often.
If it seems qualitatively piggy, sample it to see what it's spending time at.
Chances are excellent that it is not what you would have guessed.

When should I use __forceinline instead of inline?

Visual Studio includes support for __forceinline. The Microsoft Visual Studio 2005 documentation states:
The __forceinline keyword overrides
the cost/benefit analysis and relies
on the judgment of the programmer
This raises the question: When is the compiler's cost/benefit analysis wrong? And, how am I supposed to know that it's wrong?
In what scenario is it assumed that I know better than my compiler on this issue?
You know better than the compiler only when your profiling data tells you so.
The one place I am using it is licence verification.
One important factor to protect against easy* cracking is to verify being licenced in multiple places rather than only one, and you don't want these places to be the same function call.
*) Please don't turn this in a discussion that everything can be cracked - I know. Also, this alone does not help much.
The compiler is making its decisions based on static code analysis, whereas if you profile as don says, you are carrying out a dynamic analysis that can be much farther reaching. The number of calls to a specific piece of code is often largely determined by the context in which it is used, e.g. the data. Profiling a typical set of use cases will do this. Personally, I gather this information by enabling profiling on my automated regression tests. In addition to forcing inlines, I have unrolled loops and carried out other manual optimizations on the basis of such data, to good effect. It is also imperative to profile again afterwards, as sometimes your best efforts can actually lead to decreased performance. Again, automation makes this a lot less painful.
More often than not though, in my experience, tweaking alogorithms gives much better results than straight code optimization.
I've developed software for limited resource devices for 9 years or so and the only time I've ever seen the need to use __forceinline was in a tight loop where a camera driver needed to copy pixel data from a capture buffer to the device screen. There we could clearly see that the cost of a specific function call really hogged the overlay drawing performance.
The only way to be sure is to measure performance with and without. Unless you are writing highly performance critical code, this will usually be unnecessary.
For SIMD code.
SIMD code often uses constants/magic numbers. In a regular function, every const __m128 c = _mm_setr_ps(1,2,3,4); becomes a memory reference.
With __forceinline, compiler can load it once and reuse the value, unless your code exhausts registers (usually 16).
CPU caches are great but registers are still faster.
P.S. Just got 12% performance improvement by __forceinline alone.
The inline directive will be totally of no use when used for functions which are:
composed of loops,
If you want to force this decision using __forceinline
Actually, even with the __forceinline keyword. Visual C++ sometimes chooses not to inline the code. (Source: Resulting assembly source code.)
Always look at the resulting assembly code where speed is of importance (such as tight inner loops needed to be run on each frame).
Sometimes using #define instead of inline will do the trick. (of course you loose a lot of checking by using #define, so use it only when and where it really matters).
Actually, boost is loaded with it.
For example
BOOST_CONTAINER_FORCEINLINE flat_tree& operator=(BOOST_RV_REF(flat_tree) x)
BOOST_NOEXCEPT_IF( (allocator_traits_type::propagate_on_container_move_assignment::value ||
allocator_traits_type::is_always_equal::value) &&
{ m_data = boost::move(x.m_data); return *this; }
BOOST_CONTAINER_FORCEINLINE const value_compare &priv_value_comp() const
{ return static_cast<const value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE value_compare &priv_value_comp()
{ return static_cast<value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE const key_compare &priv_key_comp() const
{ return this->priv_value_comp().get_comp(); }
BOOST_CONTAINER_FORCEINLINE key_compare &priv_key_comp()
{ return this->priv_value_comp().get_comp(); }
// accessors:
BOOST_CONTAINER_FORCEINLINE Compare key_comp() const
{ return this->m_data.get_comp(); }
BOOST_CONTAINER_FORCEINLINE value_compare value_comp() const
{ return this->m_data; }
BOOST_CONTAINER_FORCEINLINE allocator_type get_allocator() const
{ return this->m_data.m_vect.get_allocator(); }
BOOST_CONTAINER_FORCEINLINE const stored_allocator_type &get_stored_allocator() const
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE stored_allocator_type &get_stored_allocator()
{ return this->m_data.m_vect.get_stored_allocator(); }
{ return this->m_data.m_vect.begin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator begin() const
{ return this->cbegin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator cbegin() const
{ return this->m_data.m_vect.begin(); }
There are several situations where the compiler is not able to determine categorically whether it is appropriate or beneficial to inline a function. Inlining may involve trade-off's that the compiler is unwilling to make, but you are (e.g,, code bloat).
In general, modern compilers are actually pretty good at making this decision.
When you know that the function is going to be called in one place several times for a complicated calculation, then it is a good idea to use __forceinline. For instance, a matrix multiplication for animation may need to be called so many times that the calls to the function will start to be noticed by your profiler. As said by the others, the compiler can't really know about that, especially in a dynamic situation where the execution of the code is unknown at compile time.
wA Case For noinline
I wanted to pitch in with an unusual suggestion and actually vouch for __noinline in MSVC or the noinline attribute/pragma in GCC and ICC as an alternative to try out first over __forceinline and its equivalents when staring at profiler hotspots. YMMV but I've gotten so much more mileage (measured improvements) out of telling the compiler what to never inline than what to always inline. It also tends to be far less invasive and can produce much more predictable and understandable hotspots when profiling the changes.
While it might seem very counter-intuitive and somewhat backward to try to improve performance by telling the compiler what not to inline, I'd claim based on my experience that it's much more harmonious with how optimizing compilers work and far less invasive to their code generation. A detail to keep in mind that's easy to forget is this:
Inlining a callee can often result in the caller, or the caller of the caller, to cease to be inlined.
This is what makes force inlining a rather invasive change to the code generation that can have chaotic results on your profiling sessions. I've even had cases where force inlining a function reused in several places completely reshuffled all top ten hotspots with the highest self-samples all over the place in very confusing ways. Sometimes it got to the point where I felt like I'm fighting with the optimizer making one thing faster here only to exchange a slowdown elsewhere in an equally common use case, especially in tricky cases for optimizers like bytecode interpretation. I've found noinline approaches so much easier to use successfully to eradicate a hotspot without exchanging one for another elsewhere.
It would be possible to inline functions much less invasively if we
could inline at the call site instead of determining whether or not
every single call to a function should be inlined. Unfortunately, I've
not found many compilers supporting such a feature besides ICC. It
makes much more sense to me if we are reacting to a hotspot to respond
by inlining at the call site instead of making every single call of a
particular function forcefully inlined. Lacking this wide support
among most compilers, I've gotten far more successful results with
Optimizing With noinline
So the idea of optimizing with noinline is still with the same goal in mind: to help the optimizer inline our most critical functions. The difference is that instead of trying to tell the compiler what they are by forcefully inlining them, we are doing the opposite and telling the compiler what functions definitely aren't part of the critical execution path by forcefully preventing them from being inlined. We are focusing on identifying the rare-case non-critical paths while leaving the compiler still free to determine what to inline in the critical paths.
Say you have a loop that executes for a million iterations, and there is a function called baz which is only very rarely called in that loop once every few thousand iterations on average in response to very unusual user inputs even though it only has 5 lines of code and no complex expressions. You've already profiled this code and the profiler shows in the disassembly that calling a function foo which then calls baz has the largest number of samples with lots of samples distributed around calling instructions. The natural temptation might be to force inline foo. I would suggest instead to try marking baz as noinline and time the results. I've managed to make certain critical loops execute 3 times faster this way.
Analyzing the resulting assembly, the speedup came from the foo function now being inlined as a result of no longer inlining baz calls into its body.
I've often found in cases like these that marking the analogical baz with noinline produces even bigger improvements than force inlining foo. I'm not a computer architecture wizard to understand precisely why but glancing at the disassembly and the distribution of samples in the profiler in such cases, the result of force inlining foo was that the compiler was still inlining the rarely-executed baz on top of foo, making foo more bloated than necessary by still inlining rare-case function calls. By simply marking baz with noinline, we allow foo to be inlined when it wasn't before without actually also inlining baz. Why the extra code resulting from inlining baz as well slowed down the overall function is still not something I understand precisely; in my experience, jump instructions to more distant paths of code always seemed to take more time than closer jumps, but I'm at a loss as to why (maybe something to do with the jump instructions taking more time with larger operands or something to do with the instruction cache). What I can definitely say for sure is that favoring noinline in such cases offered superior performance to force inlining and also didn't have such disruptive results on the subsequent profiling sessions.
So anyway, I'd suggest to give noinline a try instead and reach for it first before force inlining.
Human vs. Optimizer
In what scenario is it assumed that I know better than my compiler on
this issue?
I'd refrain from being so bold as to assume. At least I'm not good enough to do that. If anything, I've learned over the years the humbling fact that my assumptions are often wrong once I check and measure things I try with the profiler. I have gotten past the stage (over a couple of decades of making my profiler my best friend) to avoid completely blind stabs at the dark only to face humbling defeat and revert my changes, but at my best, I'm still making, at most, educated guesses. Still, I've always known better than my compiler, and hopefully, most of us programmers have always known this better than our compilers, how our product is supposed to be designed and how it is is going to most likely be used by our customers. That at least gives us some edge in the understanding of common-case and rare-case branches of code that compilers don't possess (at least without PGO and I've never gotten the best results with PGO). Compilers don't possess this type of runtime information and foresight of common-case user inputs. It is when I combine this user-end knowledge, and with a profiler in hand, that I've found the biggest improvements nudging the optimizer here and there in teaching it things like what to inline or, more commonly in my case, what to never inline.