C & C++ compilers are allowed to reorder operations as long as the as-if rule holds. What is an example of such a reordering performed by a compiler, and what is the potential performance gain to be had by doing it?
Examples involving any (C/C++) compiler on any platform are welcome.
Suppose you have the following operations being performed:
int i=0,j=0;
i++;
i++;
i++;
j++;
j++;
j++;
Ignoring for the moment that the three increments would likely be optimized away by the compiler into one +=3, you will end up having a higher processor-pipeline throughput if you reordered the operations as
i++;
j++;
i++;
j++;
i++;
j++;
since j++ doesn't have to wait for the result of i++ while in the previous case, most of the instructions had a data dependency on the previous instruction. In more complicated computations, where there isn't an easy way to reducing the number of instructions to be performed, the compiler can still look at data dependencies and reorder instructions so that an instruction depending on the result of an earlier instruction is as far away from it as possible.
Another example of such an optimization is when you are dealing with pure functions. Looking at a simple example again, assume you have a pure function f(int x) which you are summing over a loop.
int tot = 0;
int x;//something known only at runtime
for(int i = 0; i < 100; i++)
tot += f(x);
Since f is a pure function, the compiler can reorder calls to it as it pleases. In particular, it can transform this loop to
int tot = 0;
int x;//something known only at runtime
int fval = f(x);
for(int i = 0; i < 100; i++)
tot += fval;
I'm sure there are quite a few examples where reordering operations will yield faster performance. An obvious example would be to reorder loads as early as possible, since these are typically much slower than other CPU operations. By doing other, unrelated work whilst the memory is being fetched, the CPU can save time overall.
That is, given something like this:
expensive_calculation();
x = load();
do_something(x);
We can reorder it like this:
x = load();
expensive_calculation();
do_something(x);
So while we're waiting for the load to complete, we can essentially do expensive_calculation() for free.
Suppose you have a loop like:
for (i=0; i<n; i++) dest[i] = src[i];
Think memcpy. You might want the compiler to be able to vectorize this, i.e. load 8 or 16 bytes at a time and then store 8 or 16 at a time. Making that transformation is a reordering, since it would cause src[1] to be read before dest[0] is stored. Moreover, unless the compiler knows that src and dest don't overlap, it's an invalid transformation, i.e. one the compiler is not allowed to make. Use of the restrict keyword (C99 and later) allows you to tell the compiler that they don't overlap so that this kind of (extremely valuable) optimization is possible.
The same sort of thing arises all the time in operations on arrays that aren't just copying - things like vector/matrix operations, transformations of sound/image sample data, etc.
Related
My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.
if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.
I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.
I have 2 pieces of code, which do the exact same thing, but one does not actually work. Can anyone explain why?
The code is sending data via spi to an FPGA running the display. I'm almost out of code storage on the chip, so I was trying to cut down as much as I could. The change below ended up breaking for some reason, the rest of the program is exactly the same as it.
//Looping to execute code twice doesnt work
for (byte i = 0; i < 3; i++)
{
temp2 = temp % 10;
temp /= 10;
temp2 |= 0x40;
for (byte k = 0; k < 2; k++)
{
SPI.transfer(reg[j]);
delayMicroseconds(10);
SPI.transfer(temp2);
delayMicroseconds(10);
}
reg[j] -= 1;
}
.
//But copy-paste does
for (int i = 0; i < 3; i++)
{
temp2 = temp % 10;
temp /= 10;
temp2 |= 0x40;
SPI.transfer(reg[j]);
delayMicroseconds(10);
SPI.transfer(temp2);
delayMicroseconds(10);
SPI.transfer(reg[j]);
delayMicroseconds(10);
SPI.transfer(temp2);
delayMicroseconds(10);
reg[j] -= 1;
}
The most likely explanation is that some other code is relying on this loop meeting specific timing constraints, and failing if it doesn't.
The changes you have introduced that potentially affect timing include;
Changing i to be of type byte rather than int. This can affect timing - int is normally the "native" type, for which operations are more efficient by various measures. It is possible that using a byte changes the timing of the outer loop. For example, if byte is a smaller type than an int, operations on a byte may involve conversion to and from int).
Essentially you've replaced a repeated sequence of statements with an inner loop. Depending on optimisation settings, the compiler might unroll the loop (producing, in effect, the same behaviour as the first code sample), or it may not. If the compiler does not unroll the loop, the overhead of the loop construct itself (initialising the variable k, checking and incrementing it on each iteration, etc) can affect timing of the code.
If there is some reliance on this code meeting a specific timing constraint, this needs to be documented somewhere. If the requirement is not documented, I would suggest you document it (e.g. enter it as a specific requirement) and then document derived requirements (e.g. one requirement to use int to control the outer loop [if that is needed] and another requirement that the inner loop be unrolled in code rather than relying on compiler optimisation).
... breaking for some reason ...
is never nice, and if something "breaks", it can have different affects on different sketches.
reg[j] -= 1;
If e.g. this hurts, because j is out of bounds, it might have different effects whether there is a variable k or not.
Try to isolate the issue and make it reproducible ...
I bet the problem is not in the posted part. ;)
In your original post: "... I'm almost out of code storage on the chip ...".
How are you doing with your RAM?
The one time I have seen something like that in my own code was when I simply run out of RAM and my variables were corrupting the stack or the stack was corrupting my variables.
Remember that C/C++ needs some 'working' memory to keep track of everything.
Your variables are assigned from one end of your available RAM and the stack from the other end. When you use too much RAM or your functions are nested too deeply, the two collide and interfere/corrupt each other.
|************ Available RAM ************|
| Variables >>>>>>>>>>>>>>>>!!<<< Stack |
Variables grow >
Stack grow <
Problem !!
I give the following example to illustrate my question:
void fun(int i, float *pt)
{
// do something based on i
std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;
void fun_without_optmization()
{
float *example;
example = new float [LOOP];
for(unsigned int i=0; i<LOOP; i++)
{
fun(i,example);
}
delete []example;
}
void fun_with_optimization()
{
float *example;
example = new float [LOOP];
unsigned int unit_loop = LOOP/10;
unsigned int left_loop = LOOP%10;
pt = example;
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
delete []example;
}
As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?
Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.
Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?
Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.
The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.
The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.
This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.
So this code:
for (int i=0; i<n;i++) //n multiple of 4, for simplicity
a+=temp; //just a random instruction
Will actually execute faster on a parallel architecture if rewritten like:
for (int i=0;i<n ;i+=4)
{
temp0 = temp0 +temp1; //reads and additions can be executed in parallel
temp1 = temp2 +temp3;
a=temp0+temp1+a;
}
There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.
It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.
Writing your code, like:
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...
On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.
A lot of times I see code like:
int s = a / x;
for (int i = 0; i < s; i++)
// do something
If inside the for loop, neither a nor x is modified, can I then simply write:
for (int i = 0; i < a / x; i++)
// do something
and then assume that the compiler optimizes a/x, i.e replaces it with a constant?
The most important part of int s = a / x is the variable name. It gives your syntax semantics, and lets you remember 12 months later why you were dividing one thing by another. You can't name the expression in the for statement, so you lose that self-documenting nature.
const auto monthlyAmount = (int)yearlyAmount / numberOfMonths;
for (auto i = 0; i < monthlyAmount; ++i)
// do something
In this way, extracting the variable isn't for a compiler optimization, it's a human maintainability optimization.
If the compiler can be sure that the variables used in the expression in the middle of your for loop will not change between iterations, it can optimize the calculation to be performed once at the beginning of the loop, instead of every iteration.
However, consider that the variables used are global variables, or references to variables external to the function, and in your for loop you call a function. The function could change these variables. If the compiler is able to see enough of the code at that point, it could find out if this is the case to decide whether to optimize. However, compilers are only willing to look so far (otherwise things would take even longer to compile), so in general you cannot assume the optimization is performed.
The concern for optimization probably stems from the fact that the condition is evaluated before each iteration. If this is a potentially expensive operation and you don't need to do it over and over again, you can extract it out of the loop:
const std::size_t size = s.size(); // just an example
for (std::size_t i = 0; i < size; ++i)
{
}
For inexpensive operations this is probably a premature optimization and the compiler might generate the same code. The only way to be sure is to check the generated assembly code.
The problem with such Questions is that they cannot be generalized. What optimizations the Compiler will perform and what not can only be determined by a case by case analysis.
I'd certainly expect the compiler to do this if one of the following holds true:
1) Both, A and B are local variables, whose addresses are never taken.
2) The code in the loop is completely inlined.
In practice the last requirement isn't as hard as it looks, because if the functions in the body cannot be inlined, their runtime will likely dwarf the time to re-compute the bound anyway
I noticed that Google's C++ style guide cautions against inlining functions with loops or switch statements:
Another useful rule of thumb: it's typically not cost effective to
inline functions with loops or switch statements (unless, in the
common case, the loop or switch statement is never executed).
Other comments on StackOverflow have reiterated this sentiment.
Why are functions with loops or switch statements (or gotos) not suitable for or compatible with inlining. Does this apply to functions that contain any type of jump? Does it apply to functions with if statements? Also (and this might be somewhat unrelated), why is inlining functions that return a value discouraged?
I am particularly interested in this question because I am working with a segment of performance-sensitive code. I noticed that after inlining a function that contains a series of if statements, performance degrades pretty significantly. I'm using GNU Make 3.81, if that's relevant.
Inlining functions with conditional branches makes it more difficult for the CPU to accurately predict the branch statements, since each instance of the branch is independent.
If there are several branch statements, successful branch prediction saves a lot more cycles than the cost of calling the function.
Similar logic applies to unrolling loops with switch statements.
The Google guide referenced doesn't mention anything about functions returning values, so I'm assuming that reference is elsewhere, and requires a different question with an explicit citation.
While in your case, the performance degradation seems to be caused by branch mispredictions, I don't think that's the reason why the Google style guide advocates against inline functions containing loops or switch statements. There are use cases where the branch predictor can benefit from inlining.
A loop is often executed hundreds of times, so the execution time of the loop is much larger than the time saved by inlining. So the performance benefit is negligible (see Amdahl's law). OTOH, inlining functions results in increase of code size which has negative effects on the instruction cache.
In the case of switch statements, I can only guess. The rationale might be that jump tables can be rather large, wasting much more memory in the code segment than is obvious.
I think the keyword here is cost effective. Functions that cost a lot of cycles or memory are typically not worth inlining.
The purpose of a coding style guide is to tell you that if you are reading it you are unlikely to have added an optimisation to a real compiler, even less likely to have added a useful optimisation (measured by other people on realistic programs over a range of CPUs), therefore quite unlikely to be able to out-guess the guys who did. At least, do not mislead them, for example, by putting the volatile keyword in front of all your variables.
Inlining decisions in a compiler have very little to do with 'Making a Simple Branch Predictor Happy'. Or less confused.
First off, the target CPU may not even have branch prediction.
Second, a concrete example:
Imagine a compiler which has no other optimisation (turned on) except inlining. Then the only positive effect of inlining a function is that bookkeeping related to function calls (saving registers, setting up locals, saving the return address, and jumping to and back) are eliminated. The cost is duplicating code at every single location where the function is called.
In a real compiler dozens of other simple optimisations are done and the hope of inlining decisions is that those optimisations will interact (or cascade) nicely. Here is a very simple example:
int f(int s)
{
...;
switch (s) {
case 1: ...; break;
case 2: ...; break;
case 42: ...; return ...;
}
return ...;
}
void g(...)
{
int x=f(42);
...
}
When the compiler decides to inline f, it replaces the RHS of the assignment with the body of f. It substitutes the actual parameter 42 for the formal parameter s and suddenly it finds that the switch is on a constant value...so it drops all the other branches and hopefully the known value will allow further simplifications (ie they cascade).
If you are really lucky all calls to the function will be inlined (and unless f is visible outside) the original f will completely disappear from your code. So your compiler eliminated all the bookkeeping and made your code smaller at compile time. And made the code more local at runtime.
If you are unlucky, the code size grows, locality at runtime decreases and your code runs slower.
It is trickier to give a nice example when it is beneficial to inline loops because one has to assume other optimisations and the interactions between them.
The point is that it is hellishly difficult to predict what happens to a chunk of code even if you know all the ways the compiler is allowed to change it. I can't remember who said it but one should not be able to recognise the executable code produced by an optimising compiler.
I think it might be worth to extend the example provided by #user1666959. I'll answer to provide cleaner example code.
Let's consider such scenario.
/// Counts odd numbers in range [0;number]
size_t countOdd(size_t number)
{
size_t result = 0;
for (size_t i = 0; i <= number; ++i)
{
result += (i % 2);
}
return result;
}
int main()
{
return countOdd(5);
}
If the function is not inlined and uses external linking, it will execute whole loop. Imagine what happens when you inline it.
int main()
{
size_t result = 0;
for (size_t i = 0; i <= 5; ++i)
{
result += (i % 2);
}
return result;
}
Now let's enable loop unfolding optimization. Here we know that it iterates from 0 to 5, so it can be easily unfolded removing unwanted conditions in the code.
int main()
{
size_t result = 0;
// iteration 0
size_t i = 0
result += (i % 2);
// iteration 1
++i
result += (i % 2);
// iteration 2
++i
result += (i % 2);
// iteration 3
++i
result += (i % 2);
// iteration 4
++i
result += (i % 2);
// iteration 5
++i
result += (i % 2);
return result;
}
No conditions, it is faster already but that's not all. We know the value of i, so why not passing it directly?
int main()
{
size_t result = 0;
// iteration 0
result += (0 % 2);
// iteration 1
result += (1 % 2);
// iteration 2
result += (2 % 2);
// iteration 3
result += (3 % 2);
// iteration 4
result += (4 % 2);
// iteration 5
result += (5 % 2);
return result;
}
Even simpler but whait, those operations are constexpr, we can calculate them during compilation.
int main()
{
size_t result = 0;
// iteration 0
result += 0;
// iteration 1
result += 1;
// iteration 2
result += 0;
// iteration 3
result += 1;
// iteration 4
result += 0;
// iteration 5
result += 1;
return result;
}
So now the compiler sees that some of those operations don't have any effects leaving only those, which change the value. After that it removes unnecessary temporary variables and performs as much calculations, as it can during compilation, your code ends up with:
int main()
{
return 3;
}