c++ heuristic for estimating function inlining benefits - c++

In c++, what is a good heuristic for estimating the compute time benefits of inlining a function, particularly when the function is called very frequently and accounts for >= 10% of the program's execution time (eg. the evaluation function of a brute force or stochastic optimization process). Even though inlining may be ultimately beyond my control, I am still curious.

There is no general answer. It depends on the hardware, the number and
type of its arguments, and what is done in the function. And how often
it is called, and where. On a Sparc, for example, arguments (and the
return value) are passed in registers, and each function gets 16 new
registers: if the function is complex enough, those new registers may
avoid spilling that would occur if the function were inlined, and the
non-inline version may end up faster than the inlined one. On an Intel,
which is register poor, and passes arguments in registers, just the
opposite might be true, for the same function in the same program. More
generally, inlining may increase program size, reducing locality. Or
for very simple functions, it may reduce program size; but that again
depends on the architecture. The only possible way to know is to try
both, measuring the time. And even then you'll only know for that
particular program, on that particular hardware.

A function call and return on some architectures take as few as one instruction each (although they're generally not RISC-like single-cycle instructions.) In general, you can compare that to the number of cycles represented by the body of the function. A simple property access might be only a single instruction, and so putting it into a non-inlined function will triple the number of instructions to execute it -- obviously a great candidate for inlining. On the other hand, a function that formats a string for printing might represent hundreds of instructions, so two more isn't going to make any difference at all.

If your bottleneck is in a recursive function, and assuming that the level of recursion is not minimal (i.e. average recursion is not just a few levels), you are better off in working with the algorithm in the function rather than with inlining.
Try, if possible, to transform the recursion into a loop or tail-recursion (that can be implicitly transformed into a loop by the compiler), or try to determine where in the function the cost is being spent. Try to minimize the impact of the internal operations (maybe you are dynamically allocating memory that could have auto storage duration, or maybe you can factor a common operation to be performed external to the function in a wrapper and passed in as an extra argument,...)
*EDIT after the comment that recursion was not intended, but rather iteration*
If the compiler has access to the definition of the function, it will make the right decision for you in most cases. If it does not have access to the definition, just move the code around so that it does see it. Maybe make the function a static function to provide an extra hint that it won't be used anywhere else, or even mark it as inline (knowing that this will not force inlining), but avoid using special attributes that will force inlining, as the compiler probably does it better than any simple heuristic that can be produced without looking at the code.

All inlining saves you is the entry/exit cost of the function, so it's only worth considering if the function does almost nothing.
Certainly if the function itself contains a function call, it's probably not worth considering.
Even if the function does very little, it has to be called so much that it owns the program counter a significant percent of the time, before any speedup of the function would be noticeable.

The behaviour here is somewhat compiler dependant. With a recursive function obviously inlining behaviour can in theory be infinite. The 'inline' keyword is only a hint to the compiler, it can choose it ignore if it can't do anything with it. Some compilers will inline the recursive function to a certain depth.
As for the 'how much will this speed things up' - unfortunately we can't provide any sort of answer to that question as 'it depends' - how much work is the function doing vs the overhead of the function call mechanism itself. Why don't you set up a test and see?

Our experience, 20+ years of writing computationally intensive C++, is that inlining is no silver bullet. You really do need to profile your code to see whether inlining will increase performance. For us except for low level 2D and 3D point and vector manipulations inlining is a waste of time. You are far better off working out a better algorithm than trying to micromanage clock ticks.

Related

Profiler tells that function call overhead is 10x of normal statement

I got a profile result states that overhead of calling a function is very large.
It is currently a bottle neck of my program.
The function is in a template class :-
template<class U> class CustomArray{
....
public: U& operator[](int n){ //<-- 2.8%
... some cheap assertion ... //<-- 0.2%
return database()[n]; //<-- 0.3% (just add address to allocated mem)
} //<-- 2.7%
}
(^ The image was edited a little to protect me from my boss.)
Question
Is it possible? Is profiler wrong?
If so, how to optimize it?
I have tried inline keyword (no different). This function is already inline, isn't it?
I am using Visual Studio 2015's profiler (optimization -O2).
The result is very inconsistent with How much overhead is there in calling a function in C++?.
Edit: I confirm that Profiling Collection = Sampling (not Instrumention).
Let's assume you are using the default sampling method of profiling in Visual Studio.
Such profilers usually work at the assembly level, for example, by sampling the current instruction pointer periodically. They then use debug data to try to map that data back to source lines. For heavily optimized and inlined code, this mapping isn't always reliable (indeed, some instruction origin may not originate from any line, or it may effectively be shared among several).
In addition to making profiling tricky, this also means statements like "function call has a 10x overhead of a normal statement" isn't really meaningful: there is no "typical" function call and there certainly is no typical "normal statement". Functions can vary from totally free (when inlined or even eliminated), to somewhat expensive (mis-predicted virtual calls1) and statements span an even greater range from free to almost unlimited in cost (but a common case would be a cache miss taking hundreds of cycles).
On top of that, sampling methods often have inherent error or skew. For example, an expensive instruction may tend to spread its samples out among subsequent instructions rather than being assigned all the samples itself. This leads to additional error at the instruction level.
All this adds up to mean that while sampling results may be quite accurate for broad-stroke profiling (i.e., identifying features on the order of hundreds of cycles), you can't always read too much into very fine-grained results such as your one-line function above.
If you do want to read into those results, the first step is to see if the sampling mode has an assembly level view and to use that view, since at least then you remove entirely the assembly-to-source mapping issue.
1 Is there anything worse that could reasonably be considered a "function call" in C++?

Is defining a probability distribution costly?

I'm coding a physics simulation and I'm now feeling the need for optimizing it. I'm thinking about improving one point: one of the methods of one of my class (which I call a billion times in several cases) defines everytime a probability distribution. Here is the code:
void myClass::myMethod(){ //called billions of times in several cases
uniform_real_distribution<> probd(0,1);
uniform_int_distribution<> probh(1,h-2);
uniform_int_distribution<> probv(1,v-2);
//rest of the code
}
Could I pass the distribution as member of the class so that I won't have to define them everytime? And just initialize them in the constructor and redefine them when h and v change? Can it be a good optimizing progress? And last question, could it be something that is already corrected by the compiler (g++ in my case) when compiled with flag -O3 or -O2?
Thank you in advance!
Update: I coded it and timed both: the program turned actually a bit slower (a few percents) so I'm back at what I started with: creating the probability distributions at each loop
Answer A: I shouldn't think so, for a uniform distribution it's just going to copy the parameter values into place, maybe with a small amount of arithmetic, and that will be well optimized.
However, I believe distribution objects can have state. They can use part of the random data from a call to the generator, and are permitted save the rest of the randomness to use next time the distribution is used, in order to reduce the total number of calls to the generator. So when you destroy a distribution object you might be discarding some possibly-costly random data.
Answer B: stop guessing and test it.
Time your code, then add static to the definition of probd and time it again.
Yes
Yes
Well, there may be some advantage, but AFAIK those objects aren't really heavyweight/expensive to construct. Also, with locals you may gain something in data locality and in assumptions the optimizer can make.
I don't think they are automatically moved as class variables (especially if your class is POD - in that case I doubt the compiler will dare to modify its layout); most probably, instead, they are completely optimized away - only the code of the called methods - in particular operator() - may remain, referring directly to h and v. But this must be checked by looking at the generated assembly.
Incidentally, if you have a performance problem, besides optimizing obvious points (non-optimal algorithms used in inner loops, continuous memory allocations, removing useless copies of big objects, ...) you should really try to use a profiler to find the real "hot spots" in your code, and concentrate to optimize them instead of going randomly through all the code.
uniform_real_distribution maintains a state of type param_type which is two double values (using default template parameters). The constructor assigns to these and is otherwise trivial, the destructor is trivial.
Therefore, constructing a temporary within your function has an overhead of storing 2 double values as compared to initializing 1 pointer (or reference) or going through an indirection via this. In theory, it might therefore be faster (though, what appears to be faster, or what would make sense to run faster isn't necessary any faster). Since it's not much work, it's certainly worth trying and timing whether there's a difference, even if it is a micro-optimization.
Some 3-4 extra cycles are normally neglegible, but since you're saying "billions of times" it may of course very well make a measurable difference. 3 cycles times one billion is 1 second on a 3GHz machine.
Of course, optimization without profiling is always somewhat... awkward. You might very well find that a different part in your code that's called billions of times saves a lot more cycles.
EDIT:
Since you're not going to modify it, and since the first distribution is initialized with literal values, you might actually make it a constant (such as a constexpr or namespace level static const). That should, regardless of the other two, allow the compiler to generate the most efficient code in any case for that one.

Loops - storing data or recalculating

How much time does saving a value cost me processor-vise? Say i have a calculated value x that i will use 2 times, 5 times, or 20 times.At what point does it get more optimal to save the value calculated instead of recalculating it each time i use it?
example:
int a=0,b=-5;
for(int i=0;i<k;++i)
a+=abs(b);
or
int a=0,b=-5;
int x=abs(b);
for(int i=0;i<k;++i)
a+=x;
At what k value does the second scenario produce better results? Also, how much is this RAM dependent?
Since the value of abs(b) doesn't change inside the for loop, a compiler will most likely optimize both snippets to the same result i.e. evaluating the value of abs(b) just once.
It is almost impossible to provide an answer other than measure in a real scenario. When you cache the data in the code, it may be stored in a register (in the code you provide it will most probably be), or it might be flushed to L1 cache, or L2 cache... depending on what the loop is doing (how much data is it using?). If the value is cached in a register the cost is 0, the farther it is pushed the higher the cost it will take to retrieve the value.
In general, write code that is easy to read and maintain, then measure the performance of the application, and if that is not good, profile. Find the hotspots, find why they are hotspots and then work from there on. I doubt that caching vs. calculating abs(x) for something as above would ever be a hotspot in a real application. So don't sweat it.
I would suggest (this is without testing mind you) that the example with
int x=abs(b)
outside the loop will be faster simply because you're avoiding allocating a stack frame each iteration in order to call abs().
That being said, if the compiler is smart enough, it may figure out what you're doing and produce the same (or similar) instructions for both.
As a rule of thumb it doesn't cost you much, if anything, to store that value outside the loop, since the compiler is most likely going to store the result of abs(x) into a register anyways. In fact, when the compiler optimizes this code (assuming you have optimizations turned on), one of the first things it will do is pull that abs(x) out of the loop.
You can further help the compiler generate good code by qualifying your declaration of "x" with the "register" hint. This will ask the compiler to store x into a register value if possible.
If you want to see what the compiler actually does with your code, one thing to do is to tell it to compile but not assemble (in gcc, the option is -S) and look at the resulting assembly code. In many cases, the compiler will generate better code than you can optimize by hand. However, there's also no reason to NOT do these easy optimizations yourself.
Addendum:
Compiling the above code with optimizations turned on in GCC will result in code equivalent to:
a = abs(b) * k;
Try it and see.
For many cases it produces better perf from k=2. The example you gave is . not one. Most compilers try to perform this kind of hoisting when even low levels of optimization are enabled. The value is stored, at worst, on the local stack and so is likely to stay fairly cache warm, negating your memory concerns.
But potentially it will be held in a register.
The original has to perform an adittional branch, repeat the calculations and return the value. Abs is one example of a function the compiler may be able to recognize as a constexpr and hoist.
While developing your own classes, this is one of the reason you should try to mark members and references as construe whenever possible.

Few questions about C++ inline functions

The training materials from the class I took seem to be making two conflicting statements.
On one hand:
"Use of inline functions usually results in faster execution"
On the other hand:
"Use of inline functions may decrease performance due to more frequent
swapping"
Question 1: Are both statements true?
Question 2: What is meant by "swapping" here?
Please glance at this snippet:
int powA(int a, int b) {
return (a + b)*(a + b) ;
}
inline int powB(int a, int b) {
return (a + b)*(a + b) ;
}
int main () {
Timer *t = new Timer;
for(int a = 0; a < 9000; ++a) {
for(int b = 0; b < 9000; ++b) {
int i = (a + b)*(a + b); // 322 ms <-----
// int i = powA(a, b); // not inline : 450 ms
// int i = powB(a, b); // inline : 469 ms
}
}
double d = t->ms();
cout << "--> " << d << endl;
return 0;
}
Question 3: Why is performance so similar between powA and powB? I would have expected powB performance to be along 322ms, since it is, after all, inline.
Question 1
Yes, both statements can be true, in particular circumstances. Obviously they won't both be true at the same time.
Question 2
"Swapping" is likely a reference to OS paging behaviour, where pages are swapped out to disk when the memory pressure becomes high.
In practice, if your inline functions are small then you will usually notice a performance improvement due to eliminating the overhead of a function call and return. However, in very rare circumstances, you may cause code to grow such that it cannot completely reside inside the CPU cache (during a performance-critical tight loop), and you may experience decreased performance. However, if you're coding at that level then you probably should be coding directly in assembly language anyway.
Question 3
The inline modifier is a hint to the compiler that it might want to consider compiling the given function inline. It doesn't have to follow your directions, and the result may also depend on the given compiler options. You can always look at the generated assembly code to find out what it did.
Your benchmark may not even be doing what you want because your compiler might be smart enough to see that you're not even using the result of the function call that you assign into i, so it might not even bother to call your function. Again, look at the generated assembly code.
inline inserts the code at the call site, saving on creation of stack frame, saving/restoring registers and a call (branch). In other words, using inline (when it works) is similar to writing the code for inlined function in place of its call.
However, inline isn't guaranteed to do anything and is compiler-dependent. The compiler will sometimes inline functions that aren't inline (well, it's probably the linker that does that when link-time optimization is turned on, but it's easy to imagine situations when it can be done on compiler level - e.g. when the inlined function is static).
If you want to force MSVC to inline functions, use __forceinline and check the assembly. There should be no calls - your code should compile to simple sequence of instructions executed linearly.
Regarding the speed: you can indeed make your code faster by inlining small functions. When you inline large functions however (and "Large" is hard to define, you need to run tests to determine what's large and what's not), your code size becomes bigger. That's because the code of the inlined function is repeated over and over again at the call sites. After all, the whole point of having a call to a function is to save the instruction count by reusing the same subroutine from multiple places in code.
When the code size becomes larger, the instruction caches may be overwhelmed, leading to slower code execution.
Another point to consider: modern out-of-order CPUs (Most desktop CPUs - e.g. Intel Core Duo or i7) have a mechanism (instruction trace) to prefetch branches ahead and "inline" then at hardware level. So aggressive inlining doesn't always make sense.
In your example, you need to see the assembly that your compiler generates. It may be the same for the inline and non-inline versions. If it doesn't inline, try __forceinline if it's MSVC that you're using. If the timing is the same in both cases, it means your CPU does a good job at prefetching instructions and the execution time bottleneck is elsewhere.
Swapping is an OS term about swapping different pages of memory in and out of the running process. Basically the swap takes some time. The bigger your app is, the more swapping it may have.
When you inline a function, instead of jumping to a single subroutine, a copy of the whole function is dumped at the calling location. This makes your program bigger, and hence in theory can lead to more swapping.
Normally for very small methods (like your powA and powB) inlining should be ok and result in faster execution, but it is really just "in theory" - there are probably "bigger fish to fry" in terms of squeezing the last drop of performance out of your code.
The books statements are correct. In other words, when done properly, inline can improve performance and when done improperly can reduce performance.
It's best to only inline small functions. This will reduce the additional assembly calls to jump in memory. This is how performance is improved.
If you inline large functions, this can cause the memory paging to exceed the cache size, hence cause additional memory swapping. This is how performance is hindered.
Both statements are true, sort of. Declaring a function inline is an indicator to the compiler to inline if able. The compiler will (usually) use its own judgment on whether or not to actually inline, but in C++ declaring it inline does change the code generation, at least for symbol generation.
"Swapping" in this context refers to paging the executable image to disk. Since the executable is larger, it may be affect performance in memory constrained systems.
Answering your third question, the compiler chose the same behavior (my guess is non-inline) for both functions.
When an ordinary function is compiled, it's machine code is compiled once and put in one place separate from the other functions that call it. When executing the code, the processor has to jump to the place where code is stored, and this jump instruction takes extra time to load the function from memory. Sometimes, several jumps (or several loads and a jump) are needed to call a function, e.g. virtual functions. There is also time that is spent saving and restoring registers, and creating a stack frame, none of which is really necessary for sufficiently small inline functions.
When an inline function is compiled, all of its machine code is inserted directly into the place where it is called, so the time for the jump instruction is eliminated. The compiler also optimizes the code of the inline function based on its surroundings (e.g. register assignment can consider both the variables used outside the function and inside the function to minimize the number of registers that need to be saved). However, the inline function's code may appear in multiple places in the calling function (if it was called multiple times in the calling code), so on the whole it makes your codebase bigger. This can cause your code to grow large enough that it no longer fits in the CPU cache, in which case the processor has to go to main memory to fetch your code, and this takes longer than getting everything from cache. In some circumstances, this can offset the savings from eliminating the jump instruction, and make your code slower than if you had inlined the code.
"Swapping" usually refers to the behavior of virtual memory, which has the same kinds of tradeoffs as the CPU cache, but the time it takes to load code from disk is much longer, and the amount of memory your program has to fill for this to come into play is much larger. You're unlikely to ever see inline functions affect virtual memory performance.
Obviously both effects don't happen at once but it's difficult to know which will apply in any given circumstance.

performance of std::vector c++ size() inside loop in member function

Similar question, but less specific:
Performance issue for vector::size() in a loop
Suppose we're in a member function like:
void Object::DoStuff() {
for( int k = 0; k < (int)this->m_Array.size(); k++ )
{
this->SomeNotConstFunction();
this->ConstFunction();
double x = SomeExternalFunction(i);
}
}
1) I'm willing to believe that if only the "SomeExternalFunction" is called that the compiler will optimize and not redundantly call size() on m_Array ... is this the case?
2) Wouldn't you almost certainly get a boost in speed from doing
int N = m_Array.size()
for( int k = 0; k < N; k++ ) { ... }
if you're calling some member function that is not const ?
Edit Not sure where these down-votes and snide comments about micro-optimization are coming from, perhaps I can clarify:
Firstly, it's not to optimize per-se but just understand what the compiler will and will not fix. Usually I use the size() function but I ask now because here the array might have millions of data points.
Secondly, the situation is that "SomeNotConstFunction" might have a very rare chance of changing the size of the array, or its ability to do so might depend on some other variable being toggled. So, I'm asking at what point will the compiler fail, and what exactly is the time cost incurred by size() when the array really might change, despite human-known reasons that it won't?
Third, the operations in-loop are pretty trivial, there are just millions of them but they are embarrassingly parallel. I would hope that by externally placing the value would let the compiler vectorize some of the work.
Do not get into the habit of doing things like that.
The cases where the optimization you make in (2) is:
safe to do
has a noticeable difference
something your compiler cannot figure out on its own
are few and far in-between.
If it were just the latter two points, I would just advise that you're worrying about something unimportant. However, that first point is the real killer: you do not want to get in the habit of giving yourself extra chances to make mistakes. It's far, far easier to accelerate slow, correct code than it is to debug fast, buggy code.
Now, that said, I'll try answering your question. The definitions of the functions SomeNotConstFunction and SomeConstFunction are (presumably) in the same translation unit. So if these functions really do not modify the vector, the compiler can figure it out, and it will only "call" size once.
However, the compiler does not have access to the definition of SomeExternalFunction, and so must assume that every call to that function has the potential of modifying your vector. The presence of that function in your loop guarantees that `size is "called" every time.
I put "called" in quotes, however, because it is such a trivial function that it almost certainly gets inlined. Also, the function is ridiculously cheap -- two memory lookups (both nearly guaranteed to be cache hits), and either a subtraction and a right shift, or maybe even a specialized single instruction that does both.
Even if SomeExternalFunction does absolutely nothing, it's quite possible that "calling" size every time would still only be a small-to-negligible fraction of the running time of your loop.
Edit: In response to the edit....
what exactly is the time cost incurred by size() when the array really might change
The difference in the times you see when you time the two different versions of code. If you're doing very low level optimizations like that, you can't get answers through "pure reason" -- you must empirically test the results.
And if you really are doing such low level optimizations (and you can guarantee that the vector won't resize), you should probably be more worried about the fact the compiler doesn't know the base pointer of the array is constant, rather than it not knowing the size is constant.
If SomeExternalFunction really is external to the compilation unit, then you have pretty much no chance of the compiler vectorizing the loop, no matter what you do. (I suppose it might be possible at link time, though....) And it's also unlikely to be "trivial" because it requires function call overhead -- at least if "trivial" means the same thing to you as to me. (again, I don't know how good link time optimizations are....)
If you really can guarantee that some operations will not resize the vector, you might consider refining your class's API (or at least it's protected or private parts) to include functions that self-evidently won't resize the vector.
The size method will typically be inlined by the compiler, so there will be a minimal performance hit, though there will usually be some.
On the other hand, this is typically only true for vectors. If you are using a std::list, for instance, the size method can be quite expensive.
If you are concerned with performance, you should get in the habit of using iterators and/or algorithms like std::for_each, rather than a size-based for loop.
The micro optimization remarks are probably because the two most common implementations of vector::size() are
return _Size;
and
return _End - _Begin;
Hoisting them out of the loop will probably not noticably improve the performance.
And if it is obvious to everyone that it can be done, the compiler is also likely to notice. With modern compilers, and if SomeExternalFunction is statically linked, the compiler is usually able to see if the call might affect the vector's size.
Trust your compiler!
In MSVC 2015, it does a return (this->_Mylast() - this->_Myfirst()). I can't tell you offhand just how the optimizer might deal with this; but unless your array is const, the optimizer must allow for the possibility that you may modify its number of elements; making it hard to optimize out. In Qt, it equates to an inline function that that does a return d->size; ; that is, for a QVector.
I've taken to doing it in one particular project I'm working on, but it is for performance-oriented code. Unless you are interested in deeply optimizing something, I wouldn't bother. It probably is pretty fast any of these ways. In Qt, it is at most one pointer dereferencing, and is more typing. It looks like it could make a difference in MSVC.
I think nobody has offered a definitive answer so far; but if you really want to test it, have the compiler emit assembly source code, and inspect it both ways. I wouldn't be surprised to find that there's no difference when highly optimized. Let's not forget, though, that unoptimized performance during debug is also a factor that might be taken into consideration, when a lot of e.g. number crunching is involved.
I think the OP's original ? really could use to give how the array is declared.