Performance function call vs multiplication by 1

Performance function call vs multiplication by 1 - c++

Look at this function:
float process(float in) {
float out = in;
for (int i = 0; i < 31; ++i) {
if (biquads_[i]) {
out = biquads_[i]->filter(out);
}
}
return out;
}
biquads_ is a std::optional<Biquad>[31].
in this case i check for every optional to check if its not empty, and then call the filter function of biquad, if instead I unconditionally call filter function, changing it to multiply by 1 or simply return the input value, would be more efficient?

Most likely it won't make a shread of difference (guessing somewhat though since your question is not entirely clear). For two reasons: 1) unless the code is going to be used in a very hot path, it won't matter even if one way is a few nanoseconds faster than the other. 2) most likely your compilers optimizer will be clever enough to generate code that performs close-to (if not identical to) the same in both cases. Did you test it? Did you benchmark/profile it? If not; do so - with optimization enabled.
Strive to write clear, readable, maintainable code. Worry about micro-optimization later when you actually have a problem and your profiler points to your function as a hot-spot.

Related

Can I replace an if-statement with AND?

My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.

if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.

I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.

Potentially Inefficient For Loop C++

I noticed the below and think it is inefficient. What am I missing? I imagine there must be speed advantages I am unaware of. For context, this is production code in a brokerage firm's API.
What I saw:
const unsigned MAX_ATTEMPTS = 50;
unsigned attempt = 0;
for (;;) {
++attempt;
// logic, functions, output
if( attempt >= MAX_ATTEMPTS) {
break;
}
}
What I expected:
const unsigned MAX_ATTEMPTS = 50;
for(unsigned attempt = 0; attempt < MAX_ATTEMPTS; ++attempt){
// logic, functions, output
}
*** Corrected typo

I noticed the below and think it is inefficient. What am I missing?
That it's pointless to speculate about efficiency unless you know
whether there's a measurable problem
how long it currently takes
how long it's desirable for it to take
how much effort it would cost to improve
So, if this loop is not speed-critical and is dominated by the logic, functions, output - which for the avoidance of doubt it absolutely is unless they have output orders of magnitude more efficient than anyone else - then there is no problem in the first place, and your speculation is unlikely to be productive.
If this loop is somehow speed critical (I emphasize again how unlikely this is), then you need to measure it - and you need to decide what result would be acceptable. Otherwise you're just wasting time rearranging deckchairs instead of doing anything valuable.
Finally, if you pass tests zero through two inclusive, you still need to judge whether any improvement is worth the effort required to implement, test, review and deploy it. If it turns out to be 1% below the optimum latency decided at step 2, and some other part of your codebase is currently burning money, then this still not likely to be top priority.
From a learning rather than a business point of view however - it's great to spot potential inefficiencies like these. That's not because they're important to fix, but because you're probably wrong, and the process of learning how to benchmark them - and of understanding why you were wrong - is good experience and will improve your intuition for next time.

The only differences are that in the original code:
You can access the last value of attempt after the loop
The loop will be executed at least once.
It offers no obvious benefits. And if you ask me, the original code is quite ugly. I would have done this instead:
unsigned attempt = 1;
do {
// Logic
} while(++attempt <= MAX_ATTEMPTS);
There is a chance that one of them gets compiled to faster code. In order to find out, you need to benchmark it. Which one is faster (if any) can vary from system to system.

I think that you cut too much, and I suppose that it was something like this.
for (;;) {
++attempt;
// logic, functions, output
result = somefunc();
if(result == SUCCESS) break;
if( attempt >= MAX_ATTEMPTS) {
break;
}
}
I do not like zillions breaks in the code. I prefer:
do {
// Logic
result = somefunc();
} while(result != SUCCESS && attempt++ < MAX_ATTEMPTS);
If I am right your version should look like this
result = FAILURE;
for(unsigned attempt = 1; result != SUCCESS && attempt <= MAX_ATTEMPTS; ++attempt){
result = somefunc();
// logic, functions, output
}
There will not be any difference in performance between those versions. It is a question of the personal preferences

Calculations inside the `for (...)` statement

A lot of times I see code like:
int s = a / x;
for (int i = 0; i < s; i++)
// do something
If inside the for loop, neither a nor x is modified, can I then simply write:
for (int i = 0; i < a / x; i++)
// do something
and then assume that the compiler optimizes a/x, i.e replaces it with a constant?

The most important part of int s = a / x is the variable name. It gives your syntax semantics, and lets you remember 12 months later why you were dividing one thing by another. You can't name the expression in the for statement, so you lose that self-documenting nature.
const auto monthlyAmount = (int)yearlyAmount / numberOfMonths;
for (auto i = 0; i < monthlyAmount; ++i)
// do something
In this way, extracting the variable isn't for a compiler optimization, it's a human maintainability optimization.

If the compiler can be sure that the variables used in the expression in the middle of your for loop will not change between iterations, it can optimize the calculation to be performed once at the beginning of the loop, instead of every iteration.
However, consider that the variables used are global variables, or references to variables external to the function, and in your for loop you call a function. The function could change these variables. If the compiler is able to see enough of the code at that point, it could find out if this is the case to decide whether to optimize. However, compilers are only willing to look so far (otherwise things would take even longer to compile), so in general you cannot assume the optimization is performed.

The concern for optimization probably stems from the fact that the condition is evaluated before each iteration. If this is a potentially expensive operation and you don't need to do it over and over again, you can extract it out of the loop:
const std::size_t size = s.size(); // just an example
for (std::size_t i = 0; i < size; ++i)
{
}
For inexpensive operations this is probably a premature optimization and the compiler might generate the same code. The only way to be sure is to check the generated assembly code.

The problem with such Questions is that they cannot be generalized. What optimizations the Compiler will perform and what not can only be determined by a case by case analysis.
I'd certainly expect the compiler to do this if one of the following holds true:
1) Both, A and B are local variables, whose addresses are never taken.
2) The code in the loop is completely inlined.
In practice the last requirement isn't as hard as it looks, because if the functions in the body cannot be inlined, their runtime will likely dwarf the time to re-compute the bound anyway

Why is it not cost effective to inline functions with loops or switch statements?

I noticed that Google's C++ style guide cautions against inlining functions with loops or switch statements:
Another useful rule of thumb: it's typically not cost effective to
inline functions with loops or switch statements (unless, in the
common case, the loop or switch statement is never executed).
Other comments on StackOverflow have reiterated this sentiment.
Why are functions with loops or switch statements (or gotos) not suitable for or compatible with inlining. Does this apply to functions that contain any type of jump? Does it apply to functions with if statements? Also (and this might be somewhat unrelated), why is inlining functions that return a value discouraged?
I am particularly interested in this question because I am working with a segment of performance-sensitive code. I noticed that after inlining a function that contains a series of if statements, performance degrades pretty significantly. I'm using GNU Make 3.81, if that's relevant.

Inlining functions with conditional branches makes it more difficult for the CPU to accurately predict the branch statements, since each instance of the branch is independent.
If there are several branch statements, successful branch prediction saves a lot more cycles than the cost of calling the function.
Similar logic applies to unrolling loops with switch statements.
The Google guide referenced doesn't mention anything about functions returning values, so I'm assuming that reference is elsewhere, and requires a different question with an explicit citation.

While in your case, the performance degradation seems to be caused by branch mispredictions, I don't think that's the reason why the Google style guide advocates against inline functions containing loops or switch statements. There are use cases where the branch predictor can benefit from inlining.
A loop is often executed hundreds of times, so the execution time of the loop is much larger than the time saved by inlining. So the performance benefit is negligible (see Amdahl's law). OTOH, inlining functions results in increase of code size which has negative effects on the instruction cache.
In the case of switch statements, I can only guess. The rationale might be that jump tables can be rather large, wasting much more memory in the code segment than is obvious.
I think the keyword here is cost effective. Functions that cost a lot of cycles or memory are typically not worth inlining.

The purpose of a coding style guide is to tell you that if you are reading it you are unlikely to have added an optimisation to a real compiler, even less likely to have added a useful optimisation (measured by other people on realistic programs over a range of CPUs), therefore quite unlikely to be able to out-guess the guys who did. At least, do not mislead them, for example, by putting the volatile keyword in front of all your variables.
Inlining decisions in a compiler have very little to do with 'Making a Simple Branch Predictor Happy'. Or less confused.
First off, the target CPU may not even have branch prediction.
Second, a concrete example:
Imagine a compiler which has no other optimisation (turned on) except inlining. Then the only positive effect of inlining a function is that bookkeeping related to function calls (saving registers, setting up locals, saving the return address, and jumping to and back) are eliminated. The cost is duplicating code at every single location where the function is called.
In a real compiler dozens of other simple optimisations are done and the hope of inlining decisions is that those optimisations will interact (or cascade) nicely. Here is a very simple example:
int f(int s)
{
...;
switch (s) {
case 1: ...; break;
case 2: ...; break;
case 42: ...; return ...;
}
return ...;
}
void g(...)
{
int x=f(42);
...
}
When the compiler decides to inline f, it replaces the RHS of the assignment with the body of f. It substitutes the actual parameter 42 for the formal parameter s and suddenly it finds that the switch is on a constant value...so it drops all the other branches and hopefully the known value will allow further simplifications (ie they cascade).
If you are really lucky all calls to the function will be inlined (and unless f is visible outside) the original f will completely disappear from your code. So your compiler eliminated all the bookkeeping and made your code smaller at compile time. And made the code more local at runtime.
If you are unlucky, the code size grows, locality at runtime decreases and your code runs slower.
It is trickier to give a nice example when it is beneficial to inline loops because one has to assume other optimisations and the interactions between them.
The point is that it is hellishly difficult to predict what happens to a chunk of code even if you know all the ways the compiler is allowed to change it. I can't remember who said it but one should not be able to recognise the executable code produced by an optimising compiler.

I think it might be worth to extend the example provided by #user1666959. I'll answer to provide cleaner example code.
Let's consider such scenario.
/// Counts odd numbers in range [0;number]
size_t countOdd(size_t number)
{
size_t result = 0;
for (size_t i = 0; i <= number; ++i)
{
result += (i % 2);
}
return result;
}
int main()
{
return countOdd(5);
}
If the function is not inlined and uses external linking, it will execute whole loop. Imagine what happens when you inline it.
int main()
{
size_t result = 0;
for (size_t i = 0; i <= 5; ++i)
{
result += (i % 2);
}
return result;
}
Now let's enable loop unfolding optimization. Here we know that it iterates from 0 to 5, so it can be easily unfolded removing unwanted conditions in the code.
int main()
{
size_t result = 0;
// iteration 0
size_t i = 0
result += (i % 2);
// iteration 1
++i
result += (i % 2);
// iteration 2
++i
result += (i % 2);
// iteration 3
++i
result += (i % 2);
// iteration 4
++i
result += (i % 2);
// iteration 5
++i
result += (i % 2);
return result;
}
No conditions, it is faster already but that's not all. We know the value of i, so why not passing it directly?
int main()
{
size_t result = 0;
// iteration 0
result += (0 % 2);
// iteration 1
result += (1 % 2);
// iteration 2
result += (2 % 2);
// iteration 3
result += (3 % 2);
// iteration 4
result += (4 % 2);
// iteration 5
result += (5 % 2);
return result;
}
Even simpler but whait, those operations are constexpr, we can calculate them during compilation.
int main()
{
size_t result = 0;
// iteration 0
result += 0;
// iteration 1
result += 1;
// iteration 2
result += 0;
// iteration 3
result += 1;
// iteration 4
result += 0;
// iteration 5
result += 1;
return result;
}
So now the compiler sees that some of those operations don't have any effects leaving only those, which change the value. After that it removes unnecessary temporary variables and performs as much calculations, as it can during compilation, your code ends up with:
int main()
{
return 3;
}

Can't recursive functions be inlined? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.

Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.

Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.

[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.

If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?

Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.

Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.

Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)

When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.

When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.

Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.

Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)

Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js