Pointer dereferencing overhead vs branching / conditional statements

Pointer dereferencing overhead vs branching / conditional statements - c++

In heavy loops, such as ones found in game applications, there could be many factors that decide what part of the loop body is executed (for example, a character object will be updated differently depending on its current state) and so instead of doing:
void my_loop_function(int dt) {
if (conditionX && conditionY)
doFoo();
else
doBar();
...
}
I am used to using a function pointer that points to a certain logic function corresponding to the character's current state, as in:
void (*updater)(int);
void something_happens() {
updater = &doFoo;
}
void something_else_happens() {
updater = &doBar;
}
void my_loop_function(int dt) {
(*updater)(dt);
...
}
And in the case where I don't want to do anything, I define a dummy function and point to it when I need to:
void do_nothing(int dt) { }
Now what I'm really wondering is: am I obsessing about this needlessly? The example given above of course is simple; sometimes I need to check many variables to figure out which pieces of code I'll need to execute, and so I figured out using these "state" function pointers would indeed be more optimal, and to me, natural, but a few people I'm dealing with are heavily disagreeing.
So, is the gain from using a (virtual)function pointer worth it instead of filling my loops with conditional statements to flow the logic?
Edit: to clarify how the pointer is being set, it's done through event handling on a per-object basis. When an event occurs and, say, that character has custom logic attached to it, it sets the updater pointer in that event handler until another event occurs which will change the flow once again.
Thank you

The function pointer approach let's you make the transitions asynchronous. Rather than just passing dt to the updater, pass the object as well. Now the updater can itself be responsible for the state transitions. This localizes the state transition logic instead of globalizing it in one big ugly if ... else if ... else if ... function.
As far as the cost of this indirection, do you care? You might care if your updaters are so extremely small that the cost of a dereference plus a function call overwhelms the cost of executing the updater code. If the updaters are of any complexity, that complexity is going to overwhelm the cost of this added flexibility.

I think I 'll agree with the non-believers here. The money question in this case is how is the pointer value going to be set?
If you can somehow index into a map and produce a pointer, then this approach might justify itself through reducing code complexity. However, what you have here is rather more like a state machine spread across several functions.
Consider that something_else_happens in practice will have to examine the previous value of the pointer before setting it to another value. The same goes for something_different_happens, etc. In effect you 've scattered the logic for your state machine all over the place and made it difficult to follow.

Now what I'm really wondering is: am I obsessing about this needlessly?
If you haven't actually run your code, and found that it actually runs too slowly, then yes, I think you probably are worrying about performance too soon.
Herb Sutter and Andrei Alexandrescu in
C++ Coding Standards: 101 Rules, Guidelines, and Best Practices devote chapter 8 to this, called "Don’t optimize prematurely", and they summarise it well:
Spur not a willing horse (Latin proverb): Premature optimization is as addictive as it is unproductive. The first rule of optimization is: Don’t do it. The second rule of optimization (for experts only) is: Don’t do it yet. Measure twice, optimize once.
It's also worth reading chapter 9: "Don’t pessimize prematurely"

Testing a condition is:
fetch a value
compare (subtract)
Jump if zero (or non-zero)
Perform an indirection is:
Fetch an address
jump.
It may be even more performant!
In fact you do the "compare" before, in another place, to decide what to call. The result will be identical.
You did nothign more that an dispatch system identical to the one the compiler does when calling virtual functions.
It is proven that avoiding virtual function to implement dispatching through switches doesn't improve performance on modern compilers.
The "don't use indirection / don't use virtual / don't use function pointer / don't dynamic cast etc." in most of the case are just myths based on historical limitations of early compiler and hardware architectures..

The performance difference will depend on the hardware and the compiler
optimizer. Indirect calls can be very expensive on some machines, and
very cheap on others. And really good compilers may be able to optimize
even indirect calls, based on profiler output. Until you've actually
benchmarked both variants, on your actual target hardware and with the
compiler and compiler options you use in your final release code, it's
impossible to say.
If the indirect calls do end up being too expensive, you can still hoist
the tests out of the loop, by either setting an enum, and using a
switch in the loop, or by implementing the loop for each combination
of settings, and selecting once at the beginning. (If the functions you
point to implement the complete loop, this will almost certainly be
faster than testing the condition each time through the loop, even if
indirection is expensive.)

Related

Will modern c++ compiler optimize immutable temporary variable?

For example I have a code like this:
void func(const QString& str)
{
QString s = str.replace(QRegexp("[abc]+"), " ");
......
}
will the compiler optimize the var QRegep("[abc]+"), just construct it once instead of construct for each time func invoked? Or in other words, do I need to reimplement the coding for performance like this:
void func(const QString& str)
{
static const QRegexp sc_re("[abc]+");
QString s = str.replace(sc_re, " ");
......
}
make the QRegexp as an static const variable.

will the compiler optimize the var QRegep("[abc]+"), just construct it once instead of construct for each time func invoked?
You are assuming that each invocation of func will construct an identical QRegexp object, but how do you know that? How do you know, for example, that these objects do not contain a serial number, an integer member that is set to the number of QRegexp objects previously constructed? If such a serial number was being used, it would be wrong for the compiler to construct your temporary variable just once.
OK, we can reasonably guess that nothing like that is going on. The point, though, is that we are guessing, and the compiler is not allowed to guess. So a prerequisite for the compiler considering such an optimization would be that the definition of the constructor is available (which is an implementation detail of that class, something you should not make your code dependent on).
If the constructor's definition is available, and if that definition provably produces the same results given the same input (and probably some other technical restrictions that slip my mind at the moment), then a compiler would be allowed to make this optimization.
I do not know if any compilers choose to provide this sort of optimization when it would be both allowed and beneficial (another assumption you've made). Performance testing of the two candidates with and without optimizations enabled should reveal if your particular compiler is likely taking advantage of this.
Or in other words, do I need to reimplement the coding for performance like this:
You almost never need to re-implement for performance. (One exception would be if your code is so inefficient it would take centuries to finish. I'm pretty sure we're not in that ballpark.) A better question is "should". I'll go with that.
In this specific case I would guess "no, that looks like premature optimization". However, that is just a guess, so I'll proceed to general guidelines that you can apply.
You should re-implement for performance only if:
1) the performance gain is noticeable to an end user, or
2) the new code is easier for a programmer to read and understand.
In other cases, rely on the compiler to make appropriate optimizations.
In your case, I see the variable name sc_re and think "what is that?" So point 2 is out. That leaves the question of a noticeable performance gain. This usually is not something one can determine by simply asking around. Typically, it involves performance testing, probably of at least two types. One test would time the two candidates in an artificial heavy loop to see how large the performance gain is (if there is one at all). The other test would profile your actual program to see if this code is called often enough for the gain to be noticed by an end user. A good third test would be to give the actual program to an end user and see if they notice the difference.
Of these tests, profiling might be the most productive use of your time. (Programmers are notoriously bad at identifying true performance roadblocks without the aid of a profiler.) If you spend 2 milliseconds in this function every 5 minutes, why spend time trying to improve that? On the other hand, if you spend 1 second in this function each time it is called, the profiler might tell you whether or not this constructor is the main culprit.

C++, do functions know previously computed function values or do they have to recompute them?

Lets say I have a very costly function that checks if an object has a certain property. Another function would then, depending on whether the object has the property, do different things.
If I have previously checked for the property, would it be recomputed by the second function, or is it known?
I'm thinking of something like:
bool check_property(object){
// very costly operations...
}
void do_something(object){
if(check_property) {do thing}
else {do different thing}
}
Would the if in do_something recompute check_property?

There are several factors that have to come together for the compiler to avoid recomputing the function's result:
The compiler has to know which input values the function's result depends on. This knowledge is very difficult to extract from the code in general case. In some implementations you can help the compiler by using compiler-specific means to declare your function as "pure" or "const" (GCC function attributes)
The compiler has to make sure that the above input values did not change since the previous call to the same function. This might be very easy in some specific case, but is also very difficult in general case.
The compiler has to have the result of previous computation readily available. Normally, compilers do not deliberately "cache" such results in some dedicated storage for future reuse. The optimization in question is typically applied only when you make multiple calls to the same function in "close proximity" to each other, meaning that the previous result is easy to keep till the moment of the next call.
So, the optimization in question is certainly possible. But it is something you should expect to see in simple and very localized cases, like calling sqrt(x) several times in a row for the same value of x (in the same expression, in the same cycle and such). But for more complicated functions it is typically going to be your responsibility to either somehow avoid making multiple calls to the same expensive function, or maybe memoize the results if you believe it can benefit your code.

Unless the compiler can prove that check_property has no side effects and that all the data it depends from is the same, it is not allowed to remove the call; for all practical purposes, unless your function body is known in the current TU, it is pretty much trivial and the multiple calls happen in the same function, calling again will execute its code again. I don't know of any compiler that establish automatically a cross-call cache, because it's not trivial at all.
If you need to cache the computed values, in general you will have to do it yourself; keep in mind that it's not always trivial - generally the ugly beasts to tackle are cache invalidation (how do I know that the data used to calculate the value didn't change from the last time I calculated it? how do I avoid the cache size getting out of hand?) and multithreading concerns (is this code going to be called from multiple threads? if so, I have to synchronize the access to the cache, possibly adding coupling between unrelated threads and, in extreme cases, killing the efficiency of the cache itself).

To answer your question, yes. It will rerun it. If you want to make sure that the code doesn't run it again every time you call do_something, try adding a variable in your class that will tell you if you already ran it:
bool check_property(object){
// very costly operations...
return true;
}
void do_something(object,bool has_run){
if(has_run) {do thing}
else {do different thing}
}
void main() {
bool has_run = false;
has_run = check_property(object);
do_something(object,has_run);
}
There are of course multiple ways of doing this, and this might not fit your criteria, but it is a possible way of doing it!
I just realized that this isn't really how C++ works since everything is not in classes unlike Java. Instead you can just pass the value as an argument to the function itself. So, I have edited my code.

Understanding cost of multiple . and -> operator use?

Out of habit, when accessing values via . or ->, I assign them to variables anytime the value is going to be used more than once. My understanding is that in scripting languages like actionscript, this is pretty important. However, in C/C++, I'm wondering if this is a meaningless chore; am I wasting effort that the compiler is going to handle for me, or am I exercising a good practice, and why?
public struct Foo
{
public:
Foo(int val){m_intVal = val;)
int GetInt(){return m_intVal;}
int m_intVal; // not private for sake of last example
};
public void Bar()
{
Foo* foo = GetFooFromSomewhere();
SomeFuncUsingIntValA(foo->GetInt()); // accessing via dereference then function
SomeFuncUsingIntValB(foo->GetInt()); // accessing via dereference then function
SomeFuncUsingIntValC(foo->GetInt()); // accessing via dereference then function
// Is this better?
int val = foo->GetInt();
SomeFuncUsingIntValA(val);
SomeFuncUsingIntValB(val);
SomeFuncUsingIntValC(val);
///////////////////////////////////////////////
// And likewise with . operator
Foo fooDot(5);
SomeFuncUsingIntValA(fooDot.GetInt()); // accessing via function
SomeFuncUsingIntValB(fooDot.GetInt()); // accessing via function
SomeFuncUsingIntValC(fooDot.GetInt()); // accessing via function
// Is this better?
int valDot = foo.GetInt();
SomeFuncUsingIntValA(valDot);
SomeFuncUsingIntValB(valDot);
SomeFuncUsingIntValC(valDot);
///////////////////////////////////////////////
// And lastly, a dot operator to a member, not a function
SomeFuncUsingIntValA(fooDot.m_intVal); // accessing via member
SomeFuncUsingIntValB(fooDot.m_intVal); // accessing via member
SomeFuncUsingIntValC(fooDot.m_intVal); // accessing via member
// Is this better?
int valAsMember = foo.m_intVal;
SomeFuncUsingIntValA(valAsMember);
SomeFuncUsingIntValB(valAsMember);
SomeFuncUsingIntValC(valAsMember);
}

Ok so I try to go for an answer here.
Short version: you definitely don’t need to to this.
Long version: you might need to do this.
So here it goes: in interpreted programs like Javascript theese kind of things might have a noticeable impact. In compiled programs, like C++, not so much to the point of not at all.
Most of the times you don’t need to worry with these things because an immense amount of resources have been pulled into compiler optimization algorithms (and actual implementations) that the compiler will correctly decide what to do: allocate an extra register and save the result in order to reuse it or recompute every time and save that register space, etc.
There are instances where the compiler can’t do this. That is when it can’t prove multiple calls produce the same result. Then it has no choice but to make all the calls.
Now let’s assume that the compiler makes the wrong choice and you as a precaution make the effort of micro–optimizations. You make the optimization and you squish a 10% performance increase (which is already an overly overly optimistic figure for this kind of optimization) on that portion of code. But what do you know, your code spends only 1% of his time in that portion of code. The rest of the time is most likely spend in some hot loops and waiting for data fetch. So you spend a non-negligible amount of effort to optimize yourself the code only to get a 0.1% performance increase in total time, which won’t even be observable due to the external factors that vary the execution time by way more than that amount.
So don’t spend time with micro-optimizations in C++.
However there are cases where you might need to do this and even crazier things. But this is only after properly profiling your code and this is another discussion.
So worry about readability, don’t worry about micro–optimizations.

The question is not really related to -> and . operators, but rather about repetitive expressions in general. Yes, it is true that most modern compilers are smart enough to optimize the code that evaluates the same expression repeatedly (assuming it has no observable side-effects).
However, using an explicit intermediate variable typically makes the program much more readable, since it explicitly exposes the fact that the same value is supposed to be used in all contexts. It exposes the fact the it was your intent to use the same value in all contexts.
If you repeat using the same expression to generate that value again and again, this fact becomes much less obvious. Firstly, it is difficult to say at the first sight whether the expressions are really identical (especially when they are long). Secondly, it is not obvious whether sequential evaluations of the seemingly the same expression produce identical results.
Finally, slicing long expressions into smaller ones by using intermediate variables can significantly simply debugging the code in step-by-step debugger, since it give the user much greater degree of control through "step in" and "step over" commands.

It's for sure better in terms of readability and maintainability to have such temporary variable.
In terms of performance, you shouldn't worry about such micro-optimization at this stage (premature optimization). Moreover, modern C++ compilers can optimize it anyway, so you really shouldn't worry about it.

How expensive are NULL pointer arguments?

In implementing a menu on an embedded system in C(++) (AVR-Gcc), I ended up with void function pointer that take arguments, and usually make use of them.
// void function prototype
void (*auxFunc)(char *);
In some cases (in fact quite a few), the function actually doesn't need the argument, so I would do something like:
if (something) doAuxFunc(NULL);
I know I could just overload to a different function type, but I'm actually trying not to do this as I am instantiating multiple objects and want to keep them light.
Is calling multiple functions with NULL pointers (when they are intended for an actual pointer) worse than implementing many more function prototypes?

Checking for NULLs is a very small overhead even on a microcontroller - comparison against 0 is supposed to be lightning fast. If you overload several functions, you'll crucify readability for (a very slight) improvement in performance. Just let GCC's optimizer do its stuff, it's pretty good at it :)

Look at the disassembly, it should be generating a null (zero) to pass as the first argument, which either burns a register or a stack location, if it burns a register then it may cost you a push and pop if the calling function is starving for registers. (just using a function call may cost you pushes and pops if the function is starving for registers in order to implement the calling convention).
So there is likely a cost, but it may not be enough of a cost to change the way you do things.

Checking for 0 is really cheap, overloading is even cheaper, since it is decided at compile time which function to chose.
But if you think that your interfaces get too complicated with overloading and your function is small you should declare it inline and put it in a header. Checkig for 0 can then easily be optimized away by any decent modern compiler.

I think the "tradeoff" is ridiculously low for each approach but this is the time to do benchmarks for yourself. If you do so, please post some results :)

Virtual Function Compared to Pointer Casting

The current version of some code I'm using utilises a slightly odd way of acheiving something which I think could be acheived with polymorphism. More concretely we currently use something like
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
if( W->type == someTypeA )
{
// do some things which also involve casts such as
// ((SomeClassA*) W->objectptr)->someFieldA
}
else if( W->type == someTypeB )
{
// do some things which also involve casting such as
// ((SomeClassB*) W->objectptr)->someFieldB
}
}
To clarify; each object W contains a void *objectptr; that is to say a pointer to an arbitrary location. The field W->type keeps track of what type of object objectptr points at so that inside our if/else statements we can cast W->objectptr to the correct type and use it's fields.
However, this seems inherently bad from a code design stand point for several reasons;
We have no guarantee that the object pointed to by W->objectptr actually matches what is said in W->type so the cast is inherently unsafe.
Every time we wish to add another type we must add another elseif statement and ensure W->type is set correctly.
It seems to be this would be much better solved with something like
class CObj
{
public:
virtual void doSomething(/* some params */)=0;
};
class SomeClassA : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldA;
}
class SomeClassB : public CObj
{
public:
virtual void doSomething(/* some params */);
int someFieldB;
}
// sometime later...
for(int i=0; i<CObjList.size(); ++i)
{
CObj* W = CObjList[i];
W->doSomething(/* some params */);
}
This having been said there is the proviso that in this setting performace is important. This code will be called from a (relatively) tight loop.
My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
EDIT: It occurs to me that accessing the fields through a pointer in this way could be as bad as vtable lookups anyway due to cache misses etc. Any thoughts on this?
---- EDIT 2: Also I forgot to mention (and I know it's a bit off the original topic), inside the if statements are many calls to member functions of the surrounding class. How would you design the structure so as to be able to call these from inside doSomething()?

I'm going to answer specifically on the performance angle, because I work in a perf-critical environment and a while ago I happened to run measurements on a similar case to work out the fastest solution.
If you are on an x86, PPC, or ARM processor, you want virtual functions in this situation. The performance cost of calling a virtual function is mostly the pipeline bubble induced by mispredicting an indirect branch. Because the instruction fetch stage of the CPU can't know where the computed jmp goes, it can't start fetching bytes from the target address until the branch executes, and thus you have a stall in the pipeline corresponding to the number of stages between the first fetch stage and the branch retire. (On the PPC I know best, that's something like 25 cycles.)
You also have the latency of loading the vtable pointer, but this is often hidden by instruction reordering (the compiler moves the load instruction so it starts several cycles before you actually need the result and the CPU does other work while the data cache sends you its electrons.)
With the if-cascade approach you instead have some number n of direct, conditional branches — where the target is known at compile time, but whether the jump is taken is determined at runtime. (ie, a jump-on-equal opcode.) In this case the CPU will make a guess (predict) at whether each branch is taken or not, and start fetching instructions accordingly. So, you will only have a bubble if the CPU guesses wrong. Since you are presumably calling this function with different input each time, it's going to mispredict at least one of these branches, and you'll have the exact same bubble that you would with virtuals. In fact, you'll have a whole lot more bubbles — one per if() conditional!
With virtual functions, there's also the risk of an additional data cache miss on loading the vtable, and an icache miss on the jump target. If this function is in a tight loop, then presumably you'll be looking up and calling the same subroutines a lot, and thus the vtable and function bodies will probably still be in cache. You could measure that if you wanted to be really sure.

Use virtual functions, this hypothetical optimization means nothing. What matters is code readability, maintainability and quality.
Optimize later with the aid of a profiler if you really need to tune hot spots. Making your code unmaintainable with that kind of crap is a road to failure.
Also, virtual functions will help you do unit tests, mock interfaces, etc.
Programming is about managing complexity....

My question is then; is the added complexity of a few vtable lookups outweighed by the improved code design and extensibility and is this likely to affect performace alot?
C++ compilers should be able to implement virtual functions very efficiently, so I don't think there's a downside in using them. (And certainly a huge maintainability/readability benefit!) But you should measure to make sure.
The way they are typically implemented is that each object has a vtable pointer. (multiple pointers in case of multiple inheritance, but let's forget that for now) This has the following relative costs over non-virtual functions.
data space: one pointer per object
data space: one vtable per class (not per object!)
time: worstcase = two memory reads per function call (1 to get the vtable address, 1 to get the function address within the vtable). The offset in the vtable is known at compile time, because you know which function you're calling. There's no extra jumps.
Compare this with the costs of the non-OOP approach your existing software has.
data space: one type ID per object
code space: one if/else tree or switch statement each time you wish to call a function dependent on the object type
time: having to evaluate the if/else tree or switch statement.
I'd vote for the virtual function approach as actually being faster than the non-OOP approach, because it eliminates the need to take the time and figure out what type of object it is.

I had some experience with some largish (1M+ line I think) scientific computation code that was using a similar type based switch construct. They refactored to a properly polymorphic based approach and got a significant speedup. Exactly the opposite of what they expected!
Turned out the compiler was better able to optimise some things in that structure.
However this was a long time ago (8 years or so) .. so who knows what modern compilers will do. Don't guess - profile it.

As piotr says the right answer is probably virtual functions. You'll have to test.
But to address your concern about the casts:
Never use C-style casts in a C++ program use static_cast<>, dynamic_cast<> etc..
In your specific case, use dynamic_cast<>. At least then you will get an exception if the types are not properly related, which better than a wild crash.

CRTP would be a great idea for such kind of cases.
Edit: In your case,
template<class T>
class CObj
{
public:
void doSomething(/* some params */)
{
static_cast<T*>(this)->doSomething(...);
}
};
class SomeClassA : public CObj<SomeClassA>
{
public:
void doSomething(/* some params */);
int someFieldA;
};
class SomeClassB : public CObj<<SomeClassB>
{
public:
void doSomething(/* some params */);
int someFieldB;
};
Now you may have to choose your loop code in different way to accommodate all objects of different CObj<T> type.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js