c++ optimization - c++

I'm working on some existing c++ code that appears to be written poorly, and is very frequently called. I'm wondering if I should spend time changing it, or if the compiler is already optimizing the problem away.
I'm using Visual Studio 2008.
Here is an example:
void someDrawingFunction(....)
{
GetContext().DrawSomething(...);
GetContext().DrawSomething(...);
GetContext().DrawSomething(...);
.
.
.
}
Here is how I would do it:
void someDrawingFunction(....)
{
MyContext &c = GetContext();
c.DrawSomething(...);
c.DrawSomething(...);
c.DrawSomething(...);
.
.
.
}

Don't guess at where your program is spending time. Profile first to find your bottlenecks, then optimize those.
As for GetContext(), that depends on how complex it is. If it's just returning a class member variable, then chances are that the compiler will inline it. If GetContext() has to perform a more complicated operation (such as looking up the context in a table), the compiler probably isn't inlining it, and you may wish to only call it once, as in your second snippet.
If you're using GCC, you can also tag the GetContext() function with the pure attribute. This will allow it to perform more optimizations, such as common subexpression elimination.

If you're sure it's a performance problem, change it. If GetContext is a function call (as opposed to a macro or an inline function), then the compiler is going to HAVE to call it every time, because the compiler can't necessarily see what it's doing, and thus, the compiler probably won't know that it can eliminate the call.
Of course, you'll need to make sure that GetContext ALWAYS returns the same thing, and that this 'optimization' is safe.

If it is logically correct to do it the second way, i.e. calling GetContext() once on multiple times does not affect your program logic, i'd do it the second way even if you profile it and prove that there are no performance difference either way, so the next developer looking at this code will not ask the same question again.

Obviously, if GetContext() has side effects (I/O, updating globals, etc.) than the suggested optimization will produce different results.
So unless the compiler can somehow detect that GetContext() is pure, you should optimize it yourself.

If you're wondering what the compiler does, look at the assembly code.

That is such a simple change, I would do it.
It is quicker to fix it than to debate it.
But do you actually have a problem?
Just because it's called often doesn't mean it's called TOO often.
If it seems qualitatively piggy, sample it to see what it's spending time at.
Chances are excellent that it is not what you would have guessed.

Related

If a function is only called from one place, is it always better to inline it? [duplicate]

This question already has answers here:
When to use the inline function and when not to use it?
(14 answers)
Closed 7 years ago.
If a function is only used in one place and some profiling shows that it's not being inlined, will there always be a performance advantage in forcing the compiler to inline it?
Obviously "profile and see" (and in the case of the function in question, it did prove to be a small perf boost). I'm mostly asking out of curiosity -- are there any performance disadvantages to this with a reasonably smart compiler?
No, there are notable exceptions. Take this code for example:
void do_something_often(void) {
x++;
if (x == 100000000) {
do_a_lot_of_work();
}
}
Let's say do_something_often() is called very often and from many places. do_a_lot_of_work() is called very rarely (one out of every one hundred million calls). Inlining do_a_lot_of_work() into do_something_often() doesn't gain you anything. Since do_something_often() does almost nothing, it would be much better if it got inlined into the functions that call it, and in the rare case that they need to call do_a_lot_of_work(), they call it out of line. In that way, they are saving a function call almost every time, and saving code bloat at every call site.
One legitimate case where it makes sense not to inline a function, even if it's only called from a single location, is if the call to the function is rare and almost always skipped. Keeping the instructions before the function call and the instructions after the function call closely together in memory may allow those instructions to be kept in the processor cache, when that would be impossible if those blocks of instructions were separated in memory.
It would still be possible for the compiler to compile the function call as if using goto, avoiding having to keep track of a return address, but if the compiler has already determined that the function call is rare, then it makes sense to not pay as much time optimising that call.
You can't "force" the compiler to inline it, unless you are considering some implementation-specific tools that you have not mentioned, so the question is entirely moot.
If your compiler is already not doing so then it has a reason.
If the function is called only once, there should be no performance disadvantages in inlining it. However, that does not mean you should blindly inline all functions. For example, if the code in question is Linux kernel code and you're using the BUG_ON or WARN_ON statement to print a stack trace, you don't get the full stack trace which includes the inline function. Instead, the stack trace contains only the name of the calling function.
And, as the other answer explained, the "inline" doesn't actually force the compiler to inline the function, it just is a hint to the compiler. However, there is actually an attribute __attribute__((always_inline)) in GCC which should force the compiler to inline the function.
Make sure that the function definition is not exported. If it is, it obviously needs to be compiled, and that means that if your function is big probably the call will not be inlined. (Remember, it's the call that gets inlined, not the function. A function might get inlined in one place and called in another, etc.)
So even if you know that the function is called only from one place, the compiler might not. Make sure to hide the definition of your function to the other object files, for example by defining it in the anonymous namespace.
That being said, even if it is called from only one place, it does not mean that it is always a good idea to inline it. If your function is called rarely, it might waste a lot of memory in the CPU cache.
Depending on how you wrote your function.
In some cases, yes!
void doSomething(int *src, int *dst,
const int loopCountInner, const int loopCountOuter)
{
int i, j;
for(i=0; i<loopCounterOuter; i++){
for(j=0; j<loopCounterInner; j++){
*dst = someCalculations(*src);
src++;
dst++
}
}
}
In this example, if this function is compiled as non-inlined, then compiler basically has no knowledge about the trip count of the two loops. This is a big deal for implementations that rely strongly on compile-time optimizations.
I came across a even worse case: compiler assumes loopCounterInner to be a large value and optimized for that case, but loopCounterInner is actually 3 or 5 so the best choice is to fully unroll the inner loop!
For C++ probably the best way to do it is to make them template variables, but for C, the only way to generate differently optimized code for different use cases is to inline the function.
No, if the code is a rarely used function then keeping it off the 'hot path' will be beneficial. An inline function will use up cache space [instruction cache] whether or not the code is actually used. Tools like LTCG combined with Profile Guided optimisation (in the MSFT world, not sure about Linux) go to great pains to keep rarely used code off the hot path and this can make a significant difference

Micro optimization - compiler optimization when accesing recursive members

I'm interested in writing good code from the beginning instead of optimizing the code later. Sorry for not providing benchmark I don't have a working scenario at the moment. Thanks for your attention!
What are the performance gains of using FunctionY over FunctionX?
There is a lot of discussion on stackoverflow about this already but I'm in doubts in the case when accessing sub-members (recursive) as shown below. Will the compiler (say VS2008) optimize FunctionX into something like FunctionY?
void FunctionX(Obj * pObj)
{
pObj->MemberQ->MemberW->MemberA.function1();
pObj->MemberQ->MemberW->MemberA.function2();
pObj->MemberQ->MemberW->MemberB.function1();
pObj->MemberQ->MemberW->MemberB.function2();
..
pObj->MemberQ->MemberW->MemberZ.function1();
pObj->MemberQ->MemberW->MemberZ.function2();
}
void FunctionY(Obj * pObj)
{
W * localPtr = pObj->MemberQ->MemberW;
localPtr->MemberA.function1();
localPtr->MemberA.function2();
localPtr->MemberB.function1();
localPtr->MemberB.function2();
...
localPtr->MemberZ.function1();
localPtr->MemberZ.function2();
}
In case none of the member pointers are volatile or pointers to volatile and you don't have the operator -> overloaded for any members in a chain both functions are the same.
The optimization rule you suggested is widely known as Common Expression Elimination and is supported by vast majority of compilers for many decades.
In theory, you save on the extra pointer dereferences, HOWEVER, in the real world, the compiler will probably optimize it out for you, so it's a useless optimization.
This is why it's important to profile first, and then optimize later. The compiler is doing everything it can to help you, you might as well make sure you're not just doing something it's already doing.
if the compiler is good enough, it should translate functionX into something similar to functionY.
But you can have different result on different compiler and on the same compiler with different optimization flag.
Using a "dumb" compiler functionY should be faster, and IMHO it is more readable and faster to code. So stick with functionY
ps. you should take a look at some code style guide, normally member and function name should always start with a low-case letter

can compiler reorganize instructions over sleep call?

Is there a difference if it is the first use of the variable or not. For example are a and b treated differently?
void f(bool&a, bool& b)
{
...
a=false;
boost::this_thread::sleep...//1 sec sleep
a=true;
b=true;
...
}
EDIT: people asked why I want to know this.
1. I would like to have some way to tell the compiler not to optimize(swap the order of the execution of the instructions) in some function, and using atomic and or mutexes is much more complicated than using sleep(and in my case sleeping is not a performance problem).
2. Like I said this is generally important to know.
We can't really tell. On scenario could be that the compiler has full introspection to your function at the calling site (and possibly does inline it), in which case it can jumble your function with the caller, and then do optimizations appropriately.
It could then e.g. completely optimize away a and b because there is no code that depends on a and b. Or it might see that you violate aliasing rules so that a and b refer to the same entity, and then merge them according to your program flow.
But it could also be that you tell the compiler to not optimize at all, e.g. with g++'s -O0 flag, in which case not much will happen.
The only proof for your particular platform *, can be made by looking at the generated assembly, or by telling the compiler to please output some log about what it optimizes (g++ has many flags for that).
* compiler+flags used to compile compiler+version+add-ons, hardware, operating system; even the weather might be relevant if your compiler omits some optimizations if it takes to long [which would actually be cool feature for debug builds, imho]
They are not local (because they are references), so it can't, because it can't tell whether the called function sees them or not and has to assume that it does. If they were local variables, it could, because local variables are not visible to the called function unless pointer or reference to them was created.

Two questions about inline functions in C++

I have question when I compile an inline function in C++.
Can a recursive function work with inline. If yes then please describe how.
I am sure about loop can't work with it but I have read somewhere recursive would work, If we pass constant values.
My friend send me some inline recursive function as constant parameter and told me that would be work but that not work on my laptop, no error at compile time but at run time display nothing and I have to terminate it by force break.
inline f(int n) {
if(n<=1)
return 1;
else {
n=n*f(n-1);
return n;
}
}
how does this work?
I am using turbo 3.2
Also, if an inline function code is too large then, can the compiler change it automatically in normal function?
thanks
This particular function definitely can be inlined. That is because the compiler can figure out that this particular form of recursion (tail-recursion) can be trivially turned into a normal loop. And with a normal loop it has no problem inlining it at all.
Not only can the compiler inline it, it can even calculate the result for a compile-time constant without generating any code for the function.
With GCC 4.4
int fac = f(10);
produced this instruction:
movl $3628800, 4(%esp)
You can easily verify when checking assembly output, that the function is indeed inlined for input that is not known at compile-time.
I suppose your friend was trying to say that if given a constant, the compiler could calculate the result entirely at compile time and just inline the answer at the call site. c++0x actually has a mechanism for this called constexpr, but there are limits to how complex the code is allowed to be. But even with the current version of c++, it is possible. It depends entirely on the compiler.
This function may be a good candidate given that it clearly only references the parameter to calculate the result. Some compilers even have non-portable attributes to help the compiler decide this. For example, gcc has pure and const attributes (listed on that page I just linked) that inform the compiler that this code only operates on the parameters and has no side effects, making it more likely to be calculated at compile time.
Even without this, it will still compile! The reason why is that the compiler is allowed to not inline a function if it decides. Think of the inline keyword more of a suggestion than an instruction.
Assuming that the compiler doesn't calculate the whole thing at compile time, inlining is not completely possible without other optimizations applied (see EDIT below) since it must have an actual function to call. However, it may get partially inlined. In that case the compiler will inline the initial call, but also emit a regular version of the function which will get called during recursion.
As for your second question, yes, size is one of the factors that compilers use to decide if it is appropriate to inline something.
If running this code on your laptop takes a very long time, then it is possible that you just gave it very large values and it is simply taking a long time to calculate the answer... The code look ok, but keep in mind that values above 13! are going to overflow a 32-bit int. What value did you attempt to pass?
The only way to know what actually happens is to compile it an look at the assembly generated.
PS: you may want to look into a more modern compiler if you are concerned with optimizations. For windows there is MingW and free versions of Visual C++. For *NIX there is of course g++.
EDIT: There is also a thing called Tail Recursion Optimization which allows compilers to convert certain types of recursive algorithms to iterative, making them better candidates for inlining. (In addition to making them more stack space efficient).
Recursive function can be inlined to certain limited depth of recursion. Some compilers have an option that lets you to specify how deep you want to go when inlining recursive functions. Basically, the compiler "flattens" several nested levels of recursion. If the execution reaches the end of "flattened" code, the code calls itself in usual recursive fashion and so on. Of course, if the depth of recursion is a run-time value, the compiler has to check the corresponding condition every time before executing each original recursive step inside the "flattened" code. In other words, there's nothing too unusual about inlining a recursive function. It is like unrolling a loop. There's no requirement for the parameters to be constant.
What you mean by "I am sure about loop can't work" is not clear. It doesn't seem to make much sense. Functions with a loop can be easily inlined and there's nothing strange about it.
What are you trying to say about your example that "displays nothing" is not clear either. There is nothing in the code that would "display" anything. No wonder it "displays nothing". On top of that, you posted invalid code. C++ language does not allow function declarations without an explicit return type.
As for your last question, yes, the compiler is completely free to implement an inline function as "normal" function. It has nothing to do with function being "too large" though. It has everything to do with more-or-less complex heuristic criteria used by that specific compiler to make the decision about inlining a function. It can take the size into account. It can take other things into account.
You can inline recursive functions. The compiler normally unrolls them to a certain depth- in VS you can even have a pragma for this, and the compiler can also do partial inlining. It essentially converts it into loops. Also, as #Evan Teran said, the compiler is not forced to inline a function that you suggest at all. It might totally ignore you and that's perfectly valid.
The problem with the code is not in that inline function. The constantness or not of the argument is pretty irrelevant, I'm sure.
Also, seriously, get a new compiler. There's modern free compilers for whatever OS your laptop runs.
One thing to keep in mind - according to the standard, inline is a suggestion, not an absolute guarantee. In the case of a recursive function, the compiler would not always be able to compute the recursion limit - modern compilers are getting extremely smart, a previous response shows the compiler evaluating a constant inline and simply generating the result, but consider
bigint fac = factorialOf(userInput)
there's no way the compiler can figure that one out........
As a side note, most compilers tend to ignore inlines in debug builds unless specifically instructed not to do so - makes debugging easier
Tail recursions can be converted to loops as long as the compiler can satisfactorily rearrange the internal representation to get the recursion conditional test at the end. In this case it can do the code generation to re-express the recursive function as a simple loop
As far as issues like tail recursion rewrites, partial expansions of recursive functions, etc, these are usually controlled by the optimization switches - all modern compilers are capable of pretty signficant optimization, but sometimes things do go wrong.
Remember that the inline key word merely sends a request, not a command to the compiler. The compliler may ignore yhis request if the function definition is too long or too complicated and compile the function as normal function.
in some of the cases where inline functions may not work are
For functions returning values, if a loop, a switch or a goto exists.
For functions not returning values, if a return statement exists.
If function contains static variables.
If in line functions are recursive.
hence in C++ inline recursive functions may not work.

When should I use __forceinline instead of inline?

Visual Studio includes support for __forceinline. The Microsoft Visual Studio 2005 documentation states:
The __forceinline keyword overrides
the cost/benefit analysis and relies
on the judgment of the programmer
instead.
This raises the question: When is the compiler's cost/benefit analysis wrong? And, how am I supposed to know that it's wrong?
In what scenario is it assumed that I know better than my compiler on this issue?
You know better than the compiler only when your profiling data tells you so.
The one place I am using it is licence verification.
One important factor to protect against easy* cracking is to verify being licenced in multiple places rather than only one, and you don't want these places to be the same function call.
*) Please don't turn this in a discussion that everything can be cracked - I know. Also, this alone does not help much.
The compiler is making its decisions based on static code analysis, whereas if you profile as don says, you are carrying out a dynamic analysis that can be much farther reaching. The number of calls to a specific piece of code is often largely determined by the context in which it is used, e.g. the data. Profiling a typical set of use cases will do this. Personally, I gather this information by enabling profiling on my automated regression tests. In addition to forcing inlines, I have unrolled loops and carried out other manual optimizations on the basis of such data, to good effect. It is also imperative to profile again afterwards, as sometimes your best efforts can actually lead to decreased performance. Again, automation makes this a lot less painful.
More often than not though, in my experience, tweaking alogorithms gives much better results than straight code optimization.
I've developed software for limited resource devices for 9 years or so and the only time I've ever seen the need to use __forceinline was in a tight loop where a camera driver needed to copy pixel data from a capture buffer to the device screen. There we could clearly see that the cost of a specific function call really hogged the overlay drawing performance.
The only way to be sure is to measure performance with and without. Unless you are writing highly performance critical code, this will usually be unnecessary.
For SIMD code.
SIMD code often uses constants/magic numbers. In a regular function, every const __m128 c = _mm_setr_ps(1,2,3,4); becomes a memory reference.
With __forceinline, compiler can load it once and reuse the value, unless your code exhausts registers (usually 16).
CPU caches are great but registers are still faster.
P.S. Just got 12% performance improvement by __forceinline alone.
The inline directive will be totally of no use when used for functions which are:
recursive,
long,
composed of loops,
If you want to force this decision using __forceinline
Actually, even with the __forceinline keyword. Visual C++ sometimes chooses not to inline the code. (Source: Resulting assembly source code.)
Always look at the resulting assembly code where speed is of importance (such as tight inner loops needed to be run on each frame).
Sometimes using #define instead of inline will do the trick. (of course you loose a lot of checking by using #define, so use it only when and where it really matters).
Actually, boost is loaded with it.
For example
BOOST_CONTAINER_FORCEINLINE flat_tree& operator=(BOOST_RV_REF(flat_tree) x)
BOOST_NOEXCEPT_IF( (allocator_traits_type::propagate_on_container_move_assignment::value ||
allocator_traits_type::is_always_equal::value) &&
boost::container::container_detail::is_nothrow_move_assignable<Compare>::value)
{ m_data = boost::move(x.m_data); return *this; }
BOOST_CONTAINER_FORCEINLINE const value_compare &priv_value_comp() const
{ return static_cast<const value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE value_compare &priv_value_comp()
{ return static_cast<value_compare &>(this->m_data); }
BOOST_CONTAINER_FORCEINLINE const key_compare &priv_key_comp() const
{ return this->priv_value_comp().get_comp(); }
BOOST_CONTAINER_FORCEINLINE key_compare &priv_key_comp()
{ return this->priv_value_comp().get_comp(); }
public:
// accessors:
BOOST_CONTAINER_FORCEINLINE Compare key_comp() const
{ return this->m_data.get_comp(); }
BOOST_CONTAINER_FORCEINLINE value_compare value_comp() const
{ return this->m_data; }
BOOST_CONTAINER_FORCEINLINE allocator_type get_allocator() const
{ return this->m_data.m_vect.get_allocator(); }
BOOST_CONTAINER_FORCEINLINE const stored_allocator_type &get_stored_allocator() const
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE stored_allocator_type &get_stored_allocator()
{ return this->m_data.m_vect.get_stored_allocator(); }
BOOST_CONTAINER_FORCEINLINE iterator begin()
{ return this->m_data.m_vect.begin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator begin() const
{ return this->cbegin(); }
BOOST_CONTAINER_FORCEINLINE const_iterator cbegin() const
{ return this->m_data.m_vect.begin(); }
There are several situations where the compiler is not able to determine categorically whether it is appropriate or beneficial to inline a function. Inlining may involve trade-off's that the compiler is unwilling to make, but you are (e.g,, code bloat).
In general, modern compilers are actually pretty good at making this decision.
When you know that the function is going to be called in one place several times for a complicated calculation, then it is a good idea to use __forceinline. For instance, a matrix multiplication for animation may need to be called so many times that the calls to the function will start to be noticed by your profiler. As said by the others, the compiler can't really know about that, especially in a dynamic situation where the execution of the code is unknown at compile time.
wA Case For noinline
I wanted to pitch in with an unusual suggestion and actually vouch for __noinline in MSVC or the noinline attribute/pragma in GCC and ICC as an alternative to try out first over __forceinline and its equivalents when staring at profiler hotspots. YMMV but I've gotten so much more mileage (measured improvements) out of telling the compiler what to never inline than what to always inline. It also tends to be far less invasive and can produce much more predictable and understandable hotspots when profiling the changes.
While it might seem very counter-intuitive and somewhat backward to try to improve performance by telling the compiler what not to inline, I'd claim based on my experience that it's much more harmonious with how optimizing compilers work and far less invasive to their code generation. A detail to keep in mind that's easy to forget is this:
Inlining a callee can often result in the caller, or the caller of the caller, to cease to be inlined.
This is what makes force inlining a rather invasive change to the code generation that can have chaotic results on your profiling sessions. I've even had cases where force inlining a function reused in several places completely reshuffled all top ten hotspots with the highest self-samples all over the place in very confusing ways. Sometimes it got to the point where I felt like I'm fighting with the optimizer making one thing faster here only to exchange a slowdown elsewhere in an equally common use case, especially in tricky cases for optimizers like bytecode interpretation. I've found noinline approaches so much easier to use successfully to eradicate a hotspot without exchanging one for another elsewhere.
It would be possible to inline functions much less invasively if we
could inline at the call site instead of determining whether or not
every single call to a function should be inlined. Unfortunately, I've
not found many compilers supporting such a feature besides ICC. It
makes much more sense to me if we are reacting to a hotspot to respond
by inlining at the call site instead of making every single call of a
particular function forcefully inlined. Lacking this wide support
among most compilers, I've gotten far more successful results with
noinline.
Optimizing With noinline
So the idea of optimizing with noinline is still with the same goal in mind: to help the optimizer inline our most critical functions. The difference is that instead of trying to tell the compiler what they are by forcefully inlining them, we are doing the opposite and telling the compiler what functions definitely aren't part of the critical execution path by forcefully preventing them from being inlined. We are focusing on identifying the rare-case non-critical paths while leaving the compiler still free to determine what to inline in the critical paths.
Say you have a loop that executes for a million iterations, and there is a function called baz which is only very rarely called in that loop once every few thousand iterations on average in response to very unusual user inputs even though it only has 5 lines of code and no complex expressions. You've already profiled this code and the profiler shows in the disassembly that calling a function foo which then calls baz has the largest number of samples with lots of samples distributed around calling instructions. The natural temptation might be to force inline foo. I would suggest instead to try marking baz as noinline and time the results. I've managed to make certain critical loops execute 3 times faster this way.
Analyzing the resulting assembly, the speedup came from the foo function now being inlined as a result of no longer inlining baz calls into its body.
I've often found in cases like these that marking the analogical baz with noinline produces even bigger improvements than force inlining foo. I'm not a computer architecture wizard to understand precisely why but glancing at the disassembly and the distribution of samples in the profiler in such cases, the result of force inlining foo was that the compiler was still inlining the rarely-executed baz on top of foo, making foo more bloated than necessary by still inlining rare-case function calls. By simply marking baz with noinline, we allow foo to be inlined when it wasn't before without actually also inlining baz. Why the extra code resulting from inlining baz as well slowed down the overall function is still not something I understand precisely; in my experience, jump instructions to more distant paths of code always seemed to take more time than closer jumps, but I'm at a loss as to why (maybe something to do with the jump instructions taking more time with larger operands or something to do with the instruction cache). What I can definitely say for sure is that favoring noinline in such cases offered superior performance to force inlining and also didn't have such disruptive results on the subsequent profiling sessions.
So anyway, I'd suggest to give noinline a try instead and reach for it first before force inlining.
Human vs. Optimizer
In what scenario is it assumed that I know better than my compiler on
this issue?
I'd refrain from being so bold as to assume. At least I'm not good enough to do that. If anything, I've learned over the years the humbling fact that my assumptions are often wrong once I check and measure things I try with the profiler. I have gotten past the stage (over a couple of decades of making my profiler my best friend) to avoid completely blind stabs at the dark only to face humbling defeat and revert my changes, but at my best, I'm still making, at most, educated guesses. Still, I've always known better than my compiler, and hopefully, most of us programmers have always known this better than our compilers, how our product is supposed to be designed and how it is is going to most likely be used by our customers. That at least gives us some edge in the understanding of common-case and rare-case branches of code that compilers don't possess (at least without PGO and I've never gotten the best results with PGO). Compilers don't possess this type of runtime information and foresight of common-case user inputs. It is when I combine this user-end knowledge, and with a profiler in hand, that I've found the biggest improvements nudging the optimizer here and there in teaching it things like what to inline or, more commonly in my case, what to never inline.