Overhead of calling tiny functions from a tight inner loop? [C++]

Overhead of calling tiny functions from a tight inner loop? [C++] - c++

Say you see a loop like this one:
for(int i=0;
i<thing.getParent().getObjectModel().getElements(SOME_TYPE).count();
++i)
{
thing.getData().insert(
thing.GetData().Count(),
thing.getParent().getObjectModel().getElements(SOME_TYPE)[i].getName()
);
}
if this was Java I'd probably not think twice. But in performance-critical sections of C++, it makes me want to tinker with it... however I don't know if the compiler is smart enough to make it futile.
This is a made up example but all it's doing is inserting strings into a container. Please don't assume any of these are STL types, think in general terms about the following:
Is having a messy condition in the for loop going to get evaluated each time, or only once?
If those get methods are simply returning references to member variables on the objects, will they be inlined away?
Would you expect custom [] operators to get optimized at all?
In other words is it worth the time (in performance only, not readability) to convert it to something like:
ElementContainer &source =
thing.getParent().getObjectModel().getElements(SOME_TYPE);
int num = source.count();
Store &destination = thing.getData();
for(int i=0;i<num;++i)
{
destination.insert(thing.GetData().Count(), source[i].getName());
}
Remember, this is a tight loop, called millions of times a second. What I wonder is if all this will shave a couple of cycles per loop or something more substantial?
Yes I know the quote about "premature optimisation". And I know that profiling is important. But this is a more general question about modern compilers, Visual Studio in particular.

The general way to answer such questions is to looked at the produced assembly. With gcc, this involve replacing the -c flag with -S.
My own rule is not to fight the compiler. If something is to be inlined, then I make sure that the compiler has all the information needed to perform such an inline, and (possibly) I try to urge him to do so with an explicit inline keyword.
Also, inlining saves a few opcodes but makes the code grow, which, as far as L1 cache is concerned, can be very bad for performance.

All the questions you are asking are compiler-specific, so the only sensible answer is "it depends". If it is important to you, you should (as always) look at the code the compiler is emitting and do some timing experiments. Make sure your code is compiled with all optimisations turned on - this can make a big difference for things like operator[](), which is often implemented as an inline function, but which won't be inlined (in GCC at least) unless you turn on optimisation.

If the loop is that critical, I can only suggest that you look at the code generated. If the compiler is allowed to aggressively optimise the calls away then perhaps it will not be an issue. Sorry to say this but modern compilers can optimise incredibly well and the I really would suggest profiling to find the best solution in your particular case.

If the methods are small and can and will be inlined, then the compiler may do the same optimizations that you have done. So, look at the generated code and compare.
Edit: It is also important to mark const methods as const, e.g. in your example count() and getName() should be const to let the compiler know that these methods do not alter the contents of the given object.

As a rule, you should not have all that garbage in your "for condition" unless the result is going to be changing during your loop execution.
Use another variable set outside the loop. This will eliminate the WTF when reading the code, it will not negatively impact performance, and it will sidestep the question of how well the functions get optimized. If those calls are not optimized this will also result in performance increase.

I think in this case you are asking the compiler to do more than it legitimately can given the scope of compile-time information it has access to. So, in particular cases the messy condition may be optimized away, but really, the compiler has no particularly good way to know what kind of side effects you might have from that long chain of function calls. I would assume that breaking out the test would be faster unless I have benchmarking (or disassembly) that shows otherwise.
This is one of the cases where the JIT compiler has a big advantage over a C++ compiler. It can in principle optimize for the most common case seen at runtime and provide optimized bytecode for that (plus checks to make sure that one falls into that case). This sort of thing is used all the time in polymorphic method calls that turn out not to actually be used polymorphically; whether it could catch something as complex as your example, though, I'm not certain.
For what it's worth, if speed really mattered, I'd split it up in Java too.

Related

Will modern c++ compiler optimize immutable temporary variable?

For example I have a code like this:
void func(const QString& str)
{
QString s = str.replace(QRegexp("[abc]+"), " ");
......
}
will the compiler optimize the var QRegep("[abc]+"), just construct it once instead of construct for each time func invoked? Or in other words, do I need to reimplement the coding for performance like this:
void func(const QString& str)
{
static const QRegexp sc_re("[abc]+");
QString s = str.replace(sc_re, " ");
......
}
make the QRegexp as an static const variable.

will the compiler optimize the var QRegep("[abc]+"), just construct it once instead of construct for each time func invoked?
You are assuming that each invocation of func will construct an identical QRegexp object, but how do you know that? How do you know, for example, that these objects do not contain a serial number, an integer member that is set to the number of QRegexp objects previously constructed? If such a serial number was being used, it would be wrong for the compiler to construct your temporary variable just once.
OK, we can reasonably guess that nothing like that is going on. The point, though, is that we are guessing, and the compiler is not allowed to guess. So a prerequisite for the compiler considering such an optimization would be that the definition of the constructor is available (which is an implementation detail of that class, something you should not make your code dependent on).
If the constructor's definition is available, and if that definition provably produces the same results given the same input (and probably some other technical restrictions that slip my mind at the moment), then a compiler would be allowed to make this optimization.
I do not know if any compilers choose to provide this sort of optimization when it would be both allowed and beneficial (another assumption you've made). Performance testing of the two candidates with and without optimizations enabled should reveal if your particular compiler is likely taking advantage of this.
Or in other words, do I need to reimplement the coding for performance like this:
You almost never need to re-implement for performance. (One exception would be if your code is so inefficient it would take centuries to finish. I'm pretty sure we're not in that ballpark.) A better question is "should". I'll go with that.
In this specific case I would guess "no, that looks like premature optimization". However, that is just a guess, so I'll proceed to general guidelines that you can apply.
You should re-implement for performance only if:
1) the performance gain is noticeable to an end user, or
2) the new code is easier for a programmer to read and understand.
In other cases, rely on the compiler to make appropriate optimizations.
In your case, I see the variable name sc_re and think "what is that?" So point 2 is out. That leaves the question of a noticeable performance gain. This usually is not something one can determine by simply asking around. Typically, it involves performance testing, probably of at least two types. One test would time the two candidates in an artificial heavy loop to see how large the performance gain is (if there is one at all). The other test would profile your actual program to see if this code is called often enough for the gain to be noticed by an end user. A good third test would be to give the actual program to an end user and see if they notice the difference.
Of these tests, profiling might be the most productive use of your time. (Programmers are notoriously bad at identifying true performance roadblocks without the aid of a profiler.) If you spend 2 milliseconds in this function every 5 minutes, why spend time trying to improve that? On the other hand, if you spend 1 second in this function each time it is called, the profiler might tell you whether or not this constructor is the main culprit.

Declare var outside loop is bad?

I wrote this basic code for a DSP/audio application I'm making:
double input = 0.0;
for (int i = 0; i < nChannels; i++) {
input = inputs[i];
and some DSP engineering expert tell to me: "you should not declare it outside the loop, otherwise it create a dependency and the compiler can't deal with it as efficiently as possible."
He's talking about var input I think. Why this? Isn't better decleare once and overwrite it?
Maybe somethings to do with different memory location used? i.e. register instead of stack?

Good old K&R C compilers in the early eighties used to produce code as near as possible what the programmer wrote, and programmers used to do their best to produce optimized source code. Modern optimizing compilers can rework things provided the resulting code has same observable effects as the original code. So here, assuming the input variable is not used outside the loop, an optimizing compiler could optimize out the line double input = 0.0; because there are no observable effects until next assignation : input = inputs[i];. And it could the same factor the variable assignation outside the loop (whether in source C++ file it is inside or not) for the same reason.
Short story, unless you want to produce code for one specific compiler with one specific parameters set, and in that case you should thoroughly examine the generated assembly code, you should never worry for those low level optimizations. Some people say compiler is smarter than you, other say compiler will produce its own code whatever way I wrote mine.
What matters is just readability and variable scoping. Here input is functionaly local to the loop, so it should be declared inside the loop. Full stop. Any other optimization consideration is just useless, unless you do have special requirements for low level optimization (profiling showing that these lines require special processing).

Many people think that declaring a variable allocates some memory for you to use. It does not work like that. It does not allocate a register either.
It only creates for you a name (and an associated type) that you can use to link consumers of values with their producers.
On a 50 year old compiler (or one written by students in their 3rd year Compiler Construction course), that may be implemented by indeed allocating some memory for the variable on the stack, and using that every time the variable is referenced. It's simple, it works, and it's horribly inefficient. A good step up is putting local variables in registers when possible, but that uses registers inefficiently and it's not where we're at currently (have been for some time).
Linking consumers with producers creates a data flow graph. In most modern compilers, it's the edges in that graph that receive registers. This is completely removed from any variables as you declared them. They no longer exist. You can see this in action if you use -emit-llvm in clang.
So variables aren't real, they're just labels. Use them as you want.

It is better to declare variable inside the loop, but the reason is wrong.
There is a rule of thumb: declare variables in the smallest scope possible. Your code is more readable and less error prone this way.
As for performance question, it doesn't matter at all for any modern compiler where exactly you declare your variables. For example, clang eliminates variable entirely at -O1 from its own IR: https://godbolt.org/g/yjs4dA
One corner case, however: if you ever takes an address of input, variable can't be eliminated (easily), and you should declare it inside the loop, if you care about performance.

Understanding cost of multiple . and -> operator use?

Out of habit, when accessing values via . or ->, I assign them to variables anytime the value is going to be used more than once. My understanding is that in scripting languages like actionscript, this is pretty important. However, in C/C++, I'm wondering if this is a meaningless chore; am I wasting effort that the compiler is going to handle for me, or am I exercising a good practice, and why?
public struct Foo
{
public:
Foo(int val){m_intVal = val;)
int GetInt(){return m_intVal;}
int m_intVal; // not private for sake of last example
};
public void Bar()
{
Foo* foo = GetFooFromSomewhere();
SomeFuncUsingIntValA(foo->GetInt()); // accessing via dereference then function
SomeFuncUsingIntValB(foo->GetInt()); // accessing via dereference then function
SomeFuncUsingIntValC(foo->GetInt()); // accessing via dereference then function
// Is this better?
int val = foo->GetInt();
SomeFuncUsingIntValA(val);
SomeFuncUsingIntValB(val);
SomeFuncUsingIntValC(val);
///////////////////////////////////////////////
// And likewise with . operator
Foo fooDot(5);
SomeFuncUsingIntValA(fooDot.GetInt()); // accessing via function
SomeFuncUsingIntValB(fooDot.GetInt()); // accessing via function
SomeFuncUsingIntValC(fooDot.GetInt()); // accessing via function
// Is this better?
int valDot = foo.GetInt();
SomeFuncUsingIntValA(valDot);
SomeFuncUsingIntValB(valDot);
SomeFuncUsingIntValC(valDot);
///////////////////////////////////////////////
// And lastly, a dot operator to a member, not a function
SomeFuncUsingIntValA(fooDot.m_intVal); // accessing via member
SomeFuncUsingIntValB(fooDot.m_intVal); // accessing via member
SomeFuncUsingIntValC(fooDot.m_intVal); // accessing via member
// Is this better?
int valAsMember = foo.m_intVal;
SomeFuncUsingIntValA(valAsMember);
SomeFuncUsingIntValB(valAsMember);
SomeFuncUsingIntValC(valAsMember);
}

Ok so I try to go for an answer here.
Short version: you definitely don’t need to to this.
Long version: you might need to do this.
So here it goes: in interpreted programs like Javascript theese kind of things might have a noticeable impact. In compiled programs, like C++, not so much to the point of not at all.
Most of the times you don’t need to worry with these things because an immense amount of resources have been pulled into compiler optimization algorithms (and actual implementations) that the compiler will correctly decide what to do: allocate an extra register and save the result in order to reuse it or recompute every time and save that register space, etc.
There are instances where the compiler can’t do this. That is when it can’t prove multiple calls produce the same result. Then it has no choice but to make all the calls.
Now let’s assume that the compiler makes the wrong choice and you as a precaution make the effort of micro–optimizations. You make the optimization and you squish a 10% performance increase (which is already an overly overly optimistic figure for this kind of optimization) on that portion of code. But what do you know, your code spends only 1% of his time in that portion of code. The rest of the time is most likely spend in some hot loops and waiting for data fetch. So you spend a non-negligible amount of effort to optimize yourself the code only to get a 0.1% performance increase in total time, which won’t even be observable due to the external factors that vary the execution time by way more than that amount.
So don’t spend time with micro-optimizations in C++.
However there are cases where you might need to do this and even crazier things. But this is only after properly profiling your code and this is another discussion.
So worry about readability, don’t worry about micro–optimizations.

The question is not really related to -> and . operators, but rather about repetitive expressions in general. Yes, it is true that most modern compilers are smart enough to optimize the code that evaluates the same expression repeatedly (assuming it has no observable side-effects).
However, using an explicit intermediate variable typically makes the program much more readable, since it explicitly exposes the fact that the same value is supposed to be used in all contexts. It exposes the fact the it was your intent to use the same value in all contexts.
If you repeat using the same expression to generate that value again and again, this fact becomes much less obvious. Firstly, it is difficult to say at the first sight whether the expressions are really identical (especially when they are long). Secondly, it is not obvious whether sequential evaluations of the seemingly the same expression produce identical results.
Finally, slicing long expressions into smaller ones by using intermediate variables can significantly simply debugging the code in step-by-step debugger, since it give the user much greater degree of control through "step in" and "step over" commands.

It's for sure better in terms of readability and maintainability to have such temporary variable.
In terms of performance, you shouldn't worry about such micro-optimization at this stage (premature optimization). Moreover, modern C++ compilers can optimize it anyway, so you really shouldn't worry about it.

can the compiler inline methods that generate objects within a loop?

Just a question for my own curiosity. I have heard many times that it's best to use the copy/destroy paradigm when writing a method. So if you have a method like this:
OtherClass MyClass::getObject(){
OtherClass returnedObject;
return returnedObject;
}
supposedly the compiler will optimize this by essentially inlining the method and generating the class on the stack of the method that calls getObject. I'm wondering how would that work in a loop such as this
for(int i=0; i<10; i++){
list.push_back(myClass.getObject());
}
would the compiler put 10 instances of OtherClass on the stack so it could inline this method and avoid the copy and destroy that would happen in unoptimized code? What about code like this:
while(!isDone){
list.push_back(myClass.getObject());
//other logic which decides rather or not to set isDone
}
In this case the compiler couldn't possible know how many times getObject will be called so presumable it can pre-allocate anything to the stack, so my assumption is no inlining is done and every time the method is called I will pay the full cost of copying OtherObject?
I realize that all compilers are different, and that this depends on rather the compiler believes this code is optimal. I'm speaking only in general terms, how will most compiles be most likely to respond? I'm curious how this sort of optimization is done.

for(int i=0; i<10; i++){
list.push_back(myClass.getObject());
}
would the compiler put 10 instances of OtherClass on the stack so it could inline this method and avoid the copy and destroy that would happen in unoptimized code?
It doesn't need to put 10 instances on the stack just to avoid the copy and destroy... if there's space for one object to be returned with or without Return Value Optimisation, that it can reuse that space 10 times - each time copying from that same stack space to some new heap-allocated memory by the list push_back.
It would even be within the compilers rights to allocate new memory and arrange for myClass.getObject() to construct the objects directly in that memory.
Further, iff the optimiser chooses to unroll the loop, it could potentially call myClass.getObject() 10 times - even with some overlap or parallelism - IF it can somehow convince itself that that produces the same overall result. In that situation, it would indeed need space for 10 return objects, and again it's up to the compiler whether that's on the stack or through some miraculously clever optimisation, directly in the heap memory.
In practice, I would expect compilers to need to copy from stack to heap - I doubt very much any mainstream compiler's clever enough to arrange direct construction in the heap memory. Loop unrolling and RVO are common optimisations though. But, even if both kick in, I'd expect each call to getObject to serially construct a result on the stack which is then copied to heap.
If you want to "know", write some code to test for your own compiler. You can have the constructor write out the "this" pointer value.
What about code like this:
while(!isDone){
list.push_back(myClass.getObject());
//other logic which decides rather or not to set isDone
}
The more complex and less idiomatic the code is, the less likely the compiler writers have been able and bothered to optimise for it. Here, you're not even showing us a complexity level we can speculate on. Try it for your compiler and optimisation settings and see....

It depends on which version of which compiler on which os.
Why not get your compiler to output its assembly and you can take a look yourself.
gcc - http://www.delorie.com/djgpp/v2faq/faq8_20.html
visual studio - Viewing Assembly level code from Visual C++ project

In general, an optimizing compiler can make any changes whatsoever to your code so long as the resulting program's behavior isn't visibly changed. That includes inlineing functions (or not), even if that function isn't marked inline by the programmer.

The only thing compiler has to care about is the program behavior. If the optimization keeps the program logic and data intact the optimization is legal. What goes in(all possible program input), has to come out(all possible program output) the same way as without opimization.
Whether this particular optimization is possible(it surely is, whether it is an actual optimization is different thing!) depends on the target platform instruction set and if it is feasible to implement it.

Pointer dereferencing overhead vs branching / conditional statements

In heavy loops, such as ones found in game applications, there could be many factors that decide what part of the loop body is executed (for example, a character object will be updated differently depending on its current state) and so instead of doing:
void my_loop_function(int dt) {
if (conditionX && conditionY)
doFoo();
else
doBar();
...
}
I am used to using a function pointer that points to a certain logic function corresponding to the character's current state, as in:
void (*updater)(int);
void something_happens() {
updater = &doFoo;
}
void something_else_happens() {
updater = &doBar;
}
void my_loop_function(int dt) {
(*updater)(dt);
...
}
And in the case where I don't want to do anything, I define a dummy function and point to it when I need to:
void do_nothing(int dt) { }
Now what I'm really wondering is: am I obsessing about this needlessly? The example given above of course is simple; sometimes I need to check many variables to figure out which pieces of code I'll need to execute, and so I figured out using these "state" function pointers would indeed be more optimal, and to me, natural, but a few people I'm dealing with are heavily disagreeing.
So, is the gain from using a (virtual)function pointer worth it instead of filling my loops with conditional statements to flow the logic?
Edit: to clarify how the pointer is being set, it's done through event handling on a per-object basis. When an event occurs and, say, that character has custom logic attached to it, it sets the updater pointer in that event handler until another event occurs which will change the flow once again.
Thank you

The function pointer approach let's you make the transitions asynchronous. Rather than just passing dt to the updater, pass the object as well. Now the updater can itself be responsible for the state transitions. This localizes the state transition logic instead of globalizing it in one big ugly if ... else if ... else if ... function.
As far as the cost of this indirection, do you care? You might care if your updaters are so extremely small that the cost of a dereference plus a function call overwhelms the cost of executing the updater code. If the updaters are of any complexity, that complexity is going to overwhelm the cost of this added flexibility.

I think I 'll agree with the non-believers here. The money question in this case is how is the pointer value going to be set?
If you can somehow index into a map and produce a pointer, then this approach might justify itself through reducing code complexity. However, what you have here is rather more like a state machine spread across several functions.
Consider that something_else_happens in practice will have to examine the previous value of the pointer before setting it to another value. The same goes for something_different_happens, etc. In effect you 've scattered the logic for your state machine all over the place and made it difficult to follow.

Now what I'm really wondering is: am I obsessing about this needlessly?
If you haven't actually run your code, and found that it actually runs too slowly, then yes, I think you probably are worrying about performance too soon.
Herb Sutter and Andrei Alexandrescu in
C++ Coding Standards: 101 Rules, Guidelines, and Best Practices devote chapter 8 to this, called "Don’t optimize prematurely", and they summarise it well:
Spur not a willing horse (Latin proverb): Premature optimization is as addictive as it is unproductive. The first rule of optimization is: Don’t do it. The second rule of optimization (for experts only) is: Don’t do it yet. Measure twice, optimize once.
It's also worth reading chapter 9: "Don’t pessimize prematurely"

Testing a condition is:
fetch a value
compare (subtract)
Jump if zero (or non-zero)
Perform an indirection is:
Fetch an address
jump.
It may be even more performant!
In fact you do the "compare" before, in another place, to decide what to call. The result will be identical.
You did nothign more that an dispatch system identical to the one the compiler does when calling virtual functions.
It is proven that avoiding virtual function to implement dispatching through switches doesn't improve performance on modern compilers.
The "don't use indirection / don't use virtual / don't use function pointer / don't dynamic cast etc." in most of the case are just myths based on historical limitations of early compiler and hardware architectures..

The performance difference will depend on the hardware and the compiler
optimizer. Indirect calls can be very expensive on some machines, and
very cheap on others. And really good compilers may be able to optimize
even indirect calls, based on profiler output. Until you've actually
benchmarked both variants, on your actual target hardware and with the
compiler and compiler options you use in your final release code, it's
impossible to say.
If the indirect calls do end up being too expensive, you can still hoist
the tests out of the loop, by either setting an enum, and using a
switch in the loop, or by implementing the loop for each combination
of settings, and selecting once at the beginning. (If the functions you
point to implement the complete loop, this will almost certainly be
faster than testing the condition each time through the loop, even if
indirection is expensive.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js