Is there any technology to "cache" the result of a branch choice? - c++

In my code there's a regular: if a if statement is true, it will keep true for a while, and if it changes to false, it will keep false for a while. Since the performance in this code matters, I want to make the branch predict more efficient.
Currently what I tried is to write two versions of this if statement, one is optimized with "likely" and the other is optimized with "unlikely" and use a function pointer to save which one to use, but since function pointer breaks the pipeline either, the benchmark seems no different with normal if statement. So I'm curious if there's any tech to let CPU "remember" the last choice of this if statement?
Or, do I really need to care about this?

If it stays the same for a while, the branch predictor will figure that out pretty quickly. That's why sorting an input sometimes makes code run significantly faster; the random unsorted data keeps changing the test result back and forth with no pattern the branch predictor can use, but with sorted data, it has long runs where the branch is always taken or always not taken, and that's the easiest case for branch predictors to handle.
Don't overthink this; let the branch predictor do its job. You don't need to care about it.

Related

Finding which code segment is faster than the other

Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.
The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.
Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.

What exactly is code branching

What is code branching? I've seen it mentioned in various places, especially with bit twiddling, but never really thought about it?
How does it slow a program down and what should I be thinking about while coding?
I see mention of if statements. I really don't understand how such code can slow down the code. If condition is true do following instructions, otherwise jump to another set of instructions? I see the other thread mentioning "branch prediction", maybe this is where I'm really lost. What is there to predict? The condition is right there and it can only be true or false.
I don't believe this to be a duplicate of this related question. The linked thread is talking about "Branch prediction" in reference to an unsorted array. I'm asking what is branching and why prediction is required.
The most simple example of a branch is an if statement:
if (condition)
doSomething();
Now if condition is true then doSomething() is executed. If not then the execution branches, by jumping to the statement that follows the end of the if.
In very simple machine pseudo code this might be compiled to something along these lines:
TEST condition
JZ label1 ; jump over the CALL if condition is 0
CALL doSomething
##label1
The branch point is the JZ instruction. The subsequent execution point depends on the outcome of the test of condition.
Branching affects performance because modern processors predict the outcome of branches and perform speculative execution, ahead of time. If the prediction turns out to be wrong then the speculative execution has to be unwound.
If you can arrange the code so that prediction success rates are higher, then performance is increased. That's because the speculatively executed code is now less of an overhead since it has already been executed before it was even needed. That this is possible is down to the fact that modern processors are highly parallel. Spare execution units can be put to use performing this speculative execution.
Now, there's one sort of code that never has branch prediction misses. And that is code with no branches. For branch free code, the results of speculative execution are always useful. So, all other things being equal, code without branches executes faster than code with branches.
Essentially imagine an assembly line in a factory. Imagine that, as each item passes through the assembly line, it will go to employee 1, then employee 2, on up to employee 5. After employee 5 is done with it, the item is finished and is ready to be packaged. Thus all five employees can be working on different items at the same time and not having to just wait around on each other. Unlike most assembly lines though, every single time employee 1 starts working on a new item, it's potentially a new type of item - not just the same type over and over.
Well, for whatever weird and imaginative reason, imagine the manager is standing at the very end of the assembly line. And he has a list saying, "Make this item first. Then make that type of item. Then that type of item." And so on. As he sees employee 5 finish each item and move on to the next, the manager then tells employee 1 which type of item to start working on, looking at where they are in the list at that time.
Now let's say there's a point in that list - that "sequence of computer instructions" - where it says, "Now start making a coffee cup. If it's nighttime when you finish making the cup, then start making a frozen dinner. If it's daytime, then start making a bag of coffee grounds." This is your if statement. Since the manager, in this kind of fake example, doesn't really know what time of day it's going to be until he actually sees the cup after it's finished, he could just wait until that time to call out the next item to make - either a frozen dinner or some coffee grounds.
The problem there is that if waits until the very last second like that - which he has to wait until to be absolutely sure what time of day it'll be when the cup is finished, and thus what the next item's going to be - then workers 1-4 are not going to be working on anything at all until worker 5 is finished. That completely defeats the purpose of an assembly line! So the manager takes a guess. The factory is open 7 hours in the day and only 1 hour at night. So it is much more likely that the cup will be finished in the daytime, thus warranting the coffee grounds.
So as soon as employee 2 starts working on the coffee cup, the manager calls out the coffee grounds to the employee 1. Then the assembly line just keeps moving along like it had been, until employee 5 is finished with the cup. At that time the manager finally sees what time of day it is. If it's daytime, that's great! If it's nighttime, everything started on after that coffee cup must be thrown away, and the frozen dinner must be started on. ...So essentially branch prediction is where the manager temporarily ventures a guess like that, and the line moves along faster when he's right.
Pseudo-Edit:
It is largely hardware-related. The main search phrase would probably be "computer pipeline cpu". But the list of instructions is already made up - it's just that that list of instructions has branches within it; it's not always 1, 2, 3, etc. But as stage 5 of the pipeline is finishing up instruction 10, stage 1 can already be working on instruction 14. Usually computer instructions can be broken up like that and worked on in segments. If stages 1-n are all working on something at the same time, and nothing gets trashed later, that's just faster than finishing one before starting another.
Any jump in your code is a branch. This happens in if statements function calls and loops.
Modern CPUs have long pipelines. This means the CPUs is processes various parts of multiple instructions at the same time. The problem with branches is that the pipeline might not have started processing the correct instructions. This means that the speculative instructions need to be thrown out and the processor will need to start processing the instructions from scratch.
When a branch is encountered, the CPU tries to predict which branch is going to be used. This is called branch prediction.
Most of the optimizations for branch prediction will be done by your compiler so you do not really need to worry about branching.
This probably falls into the category of only worry about branch optimizations if you have profiled the code and can see that this is a problem.
A branch is a deviation from normal control flow. Processors will execute instructions sequentially, but in a branch, the program counter is moved to another place in memory (for example, a branch depending on a condition, or a procedure call).

Is it better to calculate x-1 into a variable and then use that?

Is it better to calculate x-1 into a variable and then use that?
I have a for loop and inside (x-1). Is it better to create
new variable y=x-1, and then use y inside the loop, rather
then recalculate it many times in the for loop? I will save
many subtractions. Not sure if this is some optimization?
Don't under or over estimate the capabilities of the compiler.
Profile first.
Look at the assembly language listing of the optimized version for the function.
The compiler may be able to combine the X-1 with another instruction.
Wait until the code works completely and is robust before making optimizations. Often times, code is harder to debug when it is optimized and you could be wasting your time optimizing code that isn't used frequently.
If x does not change inside the loop, then the compiler will most likely optimize it and calculate it only once, so it should not matter. (Of course, if x does change inside the loop, then it goes without saying that you should recompute it inside the loop).
Aside from the optimization aspect, it is probably more important to write the code so it makes the most sense to another programmer (e.g., someone maintaining the code). If using x-1 inside the loop makes the code clearer, it is almost certainly better to write it that way. Unless the loop is extremely critical to overall performance, it is (in my opinion) better to focus on making the code easier to read.
Yes, that will help speed up your code. Anything that decreases the number of calculations you need to perform will increase the speed of your code, especially if your loop goes through a lot of iterations.
No need to do something over and over again if you can do it just once :)
This is assuming that x doesn't change during your loop, though. If x does change, then you'll need to recalculate it because y will be different each time through the loop.

Performance of breaking apart one loop into two loops

Good Day,
Suppose that you have a simple for loop like below...
for(int i=0;i<10;i++)
{
//statement 1
//statement 2
}
Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...
for(int i=0;i<10;i++)
{
//statement 1
}
for(int i=0;i<10;i++)
{
//statement 2
}
Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call
cds.update(objectlist);
instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects again.
It depends on your application.
Possible Drawbacks (of splitting):
your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop
Possible Gains (of splitting):
your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop
These lists are certainly not comprehensive, but already you can sense that there is a tension between code and data. So it is difficult for us to take an educated/a wild guess when we know neither.
In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.
In terms of algorithmic complexity splitting the loops makes no difference.
In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 and statement 2 are.
As noted, the complexity remains.
But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:
Data caching
Instruction caching
Speculative execution
Branch prediction
Branch target buffers
Number of available registers on the CPU
Cache sizes
(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)
Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.
Solutions:
Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
Ship different binaries and a dispatcher.
Use cache-oblivious data structures/layouts and algorithms that adapt to the target cache.
It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.
Literature:
* Agner Fog's Guides
* Intel's Guides
With two loops you will be paying for:
increased generated code size
2x as many branch predicts
depending what the data layout of statement 1 and 2 are you could be reloading data into cache.
The last point could have a huge impact in either direction. You should measure as with any perf optimization.
As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.
You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.
I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.
Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.

if, switch and function pointers speed comparison

I'm building a small interpreter so I wanted to test how fast ifs, switch and pointers to functions are, compared to each other. if with 19 else ifs is slightly faster than switch with 20 cases, and function pointers (an array of 20 function pointers) is way slower than the previous two...
I expected the results to be completely opposite, can anyone please explain?
On a modern processor, a lot of this comes down to branch prediction. While a switch statement can be implemented as a jump table that takes about the same length of time to execute any branch of the code, it's also generally pretty unpredictable -- literally; a branch predictor will often do a fairly poor job of predicting which branch gets taken, which means there's a very good chance of a pipeline bubble (typically around 15 wasted cycles or so).
An if-statement may do more comparisons, but most of the branches are probably taken the same way nearly every time, so the branch predictor can predict their results much more accurately.
Pointers to functions can also be fairly unpredictable. Worse, until fairly recently, most processors pretty much didn't even try. Only fairly recently did the add enough to most BTB (Branch Target Buffer) implementations that they can really even make a serious attempt at predicting the target of a branch via a pointer. On older processors, pointers to functions often do quite poorly in speed comparisons.