If I have a loop that I know needs to be executed n times is there a way to write a while (or for) loop without a comparison each time? If not is there a way to make a macro turn:
int i = 0;
for(i = 0; i < 5; i++) {
operation();
}
into:
operation();
operation();
operation();
operation();
operation();
P.S. This is the fastest loop I've come up with so far.
int i = 5;
while(i-- >= 0) {
operation();
}
A Sufficiently Smart Compiler will do this for you. More specifically, optimizing compilers understand loop unrolling. It's a fairly basic optimization, especially in cases like your example where the number of iterations is known at compile time.
So in short: turn on compiler optimizations and don't worry about it.
The number of instructions you write in the source code is not strictly related on the number of machine instructions the compiler will generate.
Most compilers are smarter and in your second example can generate code like:
operation();
operation();
operation();
operation();
operation();
automatically because they detect that the loop will always iterate 5 times.
Also if you do a profiling-oriented optimization and a the compiler sees that a loop has tiny a body and a very high repeat count it may unroll it even for a generic number of iterations with code like:
while (count >= 5) {
operation();
operation();
operation();
operation();
operation();
count -= 5;
}
while (count > 0) {
operation();
count--;
}
This will make for large counts about one fifth of tests compared to the naive version.
If this is worth doing or not is something that only profiling can tell.
One thing you can do if you know for sure that the code needs to be executed at least once is to write
do {
operation();
} while (--count);
instead of
while (count--) {
operation();
}
The possibility that count==0 is somewhat annoying for CPUs because requires in the code generated by most compilers an extra JMP forward:
jmp test
loop:
...operation...
test:
...do the test...
jne loop
the machine code for the do { ... } while version instead is simply
loop:
... opertion ...
... do the test...
jne loop
both loops will do comparisons..
anyhow the compiler should identify the constant iteration and unroll the loop.
You could check that with gcc and the optimization flags (-O) and look at the generated code afterwards.
More important:
Don't optimize unless there is significant reason to do!
Once the C code is compiled, the while and for loops are converted to comparison statements in machine language, so there is no way to avoid some type of comparison with the for/while loops. You could make a series of goto and arithmetic statements that avoid using a comparison, but the result would probably be less efficient. You should look into how these loops are compiled into machine language using radare2 or gdb to see how they might be improved there.
With template, you may unroll the loop (in the count is known at compile time) with something like:
namespace detail
{
template <std::size_t ... Is>
void do_operation(std::index_sequence<Is...>)
{
std::initializer_list<std::size_t>{(static_cast<void>(operation()), Is)...};
}
}
template <std::size_t N>
void do_operation()
{
detail::do_operation(std::make_index_sequence<N>());
}
Live demo
but the compiler may already do that sort of optimization for normal loop.
Related
My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.
if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.
I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.
Consider the following code:
void func(int a, size_t n)
{
const bool cond = (a==2);
if (cond){
for (size_t i=0; i<n; i++){
// do something small 1
// continue by doing something else.
}
} else {
for (size_t i=0; i<n; i++){
// do something small 2
// continue by doing something else.
}
}
}
In this code the // continue by doing something else. (which might be a large part and for some reason cannot be separated into a function) is repeated exactly the same. To avoid this repetition one can write:
void func(int a, size_t n)
{
const bool cond = (a==2);
for (size_t i=0; i<n; i++){
if (cond){
// do something small 1
} else {
// do something small 2
}
// continue by doing something else.
}
}
Now we have an if-statement inside a (let's say very large) for-loop. But the condition of the if-statement (cond) is const and will not change. Would the compiler somehow optimize the code (like change it to the initial implementation)? Any tips? Thanks.
Details do matter and you included too little. As you are asking for compiler optimizations you need to know that the compiler will optimize according to the as-if-rule. Sloppy speaking, the compiler can do any optimization as long as they do not change the observable behavior (there are few exceptions). Both your functions have zero observable behavior, hence with optimizations turned on, gcc -O3, this is what the compiler does to them:
func(int, unsigned long):
ret
func2(int, unsigned long):
ret
It is futile to speculate what the compiler does to your code. Don't speculate, but look at the output. You can do that here: https://godbolt.org/z/oznWz6.
PS: Some mantras that I should not forget to include:
Don't do premature optimization. Code should be written primiarily to be read by humans. Only when you profiled and have evidence that you can gain something by improving that function you may consider to trade performance for readability.
Also do not forget that code you write is not instructions for your CPU. Your code is an abstract description of what the final program should do. The compiler knows very well how to rearrange the code to get most out of your CPU. Typically it is much better at this than a human could possibly be.
Let's say I have
void f(const bool condition) {
if (condition) {
f2();
else {
f3();
}
f4();
if (condition) {
f5();
} else {
f6();
}
}
since condition never changes, the above can be simplied to the following
void f(const bool condition) {
if (condition) {
f2();
f4();
f5();
} else {
f3();
f4();
f5();
}
}
note that f4() is duplicated in the second code but second code part has less if branches. I tried to profile the 2 code snippets but it seems to me that the performance is almost identical. Imagine in real life the above code snippets can have many more if with the same conditions. So I'm wondering for modern x86/64 processors:
Is there any performance gain to have 2 giant if statements instead of many small ones that are based on the same condition?
Will const keyword help compiler/processor generate better branch predictions ?
First of all, your example is simple enough for any decent compiler to produce identical code for both cases.
To confuse it enough, you should do something more complex instead of simply calling f4, like so:
void f_seprate_ifs(const bool condition) {
if (condition) {
f2();
} else {
f3();
}
for ( int i = 0; i < 100; i++ ){
f4();
}
if (condition) {
f5();
} else {
f6();
}
}
void f_duplicate_f4(const bool condition) {
if (condition) {
f2();
for ( int i = 0; i < 100; i++ ){
f4();
}
f5();
} else {
f3();
for ( int i = 0; i < 100; i++ ){
f4();
}
f6();
}
}
But then it is not a matter of style, but a clear trade-off between speed and space - your are duplicating code to eliminate branch (and IMO, it is not a good trade-off for my example at all). Compiler already doing this all the time with function inlining, and has very complex heuristics on when to inline. And for your example, it even did it for you.
To summarize, do not try to do such micro optimizations unless you are absolutely sure they are necessary. Especially when they hurt readability. Especially especially when they attract copy-paste errors.
As for const modifier, again, any decent compiler will notice that condition never changes, and is effectively const, speaking in Java terms. In C++, const is very rarely provides an additional optimization opportunities. It is here for programmer, not for the compiler.
For example, for:
void f(const bool& condition){
The condition is NOT constant - and compiler must assume that it can be changed by f4, so snippets are no longer semantically equivalent.
First of all, to notice any difference, you need to run your snippets multiple times, like:
for (int i=0; i<100000000; ++i)
f(true);
You need to select the number of iterations to make the overall running time 10-30 seconds. In this case you will see performance of the function itself and not various overheads like loading your application.
Second, what is the complexity of your functions f2 ... f6? If these functions are way more complex than the f itself, again you will not notice any difference.
Third, your second version will be slightly faster, although the difference will be tiny. Adding const will not help compiler in any way.
Finally, I would recommend to look at changes that will give noticeable performance gain.
In theory eliminating any conditional operation improves the performance. But in real world there can be no difference at all.
In your particular case the compiler can easily do the proposed optimization for you, so there should be no difference, as you already have tested. One of valuable jobs of the optimizing compilers - branch elimination. They look for possibility to avoid unnecessary branching.
So, the answer to your question 1:
In most cases on modern compilers there will be no difference.
Regarding const keyword:
const itself does not help in branch prediction. The compilers can see if any variable was not modified and apply whatever they can to generate fast code. When the binary code is executed by a processor, there is no any indication that the value is constant. At least on x86 and x86-64 processors.
In any case "premature optimization is the root of all evil" (c) Donald Knuth. You need to avoid any low level optimization unless you have your profiling data that show the bottleneck. And for that you need a benchmark to analyze the performance.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Can a recursive function be inline?
What are the trade offs of making recursive functions inline.
Recursive functions that can be optimised by tail-end recursion can certainly be inlined. If the last thing a function does is call itself, then it can be converted into a plain loop.
Arbitrary recursive functions can't be inlined for the same reason a snake can't swallow its own tail.
[Edit: just noticed that although your title says "be inlined", your actual question says "making functions inline". The two effectively have nothing to do with one another, they just have confusingly similar names. In modern compilers, the primary effect of inline is the thing that originally in C99 was (I think) just a necessary detail to make inline work at all: to permit multiple definitions of a symbol with external linkage. That's because modern compilers don't pay a whole lot of attention to the programmer's opinion of whether a function should be inlined. They do pay some, though, so the confusion of concepts persists. I've answered the question in the title, which is the decision the compiler makes, not the question in the body, which is the decision the programmer makes.]
Inlining is not necessarily an all-or-nothing deal. One strategy which compilers use to decide whether to inline, is to keep inlining function calls until the resulting code is "too big". "Big" is defined by some hopefully sensible heuristic.
So consider the following recursive function (which deliberately is not simply tail-recursive):
int triangle(int n) {
if (n == 1) return 1;
return n + triangle(n-1);
}
If it's called like this:
int t100() {
return triangle(100);
}
Then there's no particular reason in principle that the usual rules that the compiler uses for inlining shouldn't result in this:
int t100() {
// inline call to triangle(100)
int result;
if (100 == 1) { result = 1; } else {
// inline call to triangle(99)
int t99;
if (100 - 1 == 1) { t99 = 1; } else {
// inline call to triangle(98)
int t98;
if (100 - 1 - 1 == 1) { t98 = 1; } else {
// oops, "too big", no more inlining
t98 = triangle(100 - 1 - 1 - 1) + 98;
}
t99 = t98 + 99;
}
result = t99 + 100;
}
return result;
}
Obviously the optimiser will have a field day with that, so it's much "smaller" than it looks:
int t100() {
return triangle(97) + 297;
}
The code in triangle itself could be "unrolled" a few steps by a few levels of inlining, in exactly the same way, except that it doesn't have the benefits of constants:
int triangle(int n) {
if (n == 1) return 1;
if (n == 2) return 3;
if (n == 3) return 6;
return triangle(n-3) + 3*n - 3;
}
I doubt whether compilers actually do this, though, I don't think I've ever noticed it [Edit: MSVC does if you tell it to, thanks peterchen].
There's an obvious potential benefit in saving call overhead, but as against that people don't really expect recursive functions to get inlined, and there's no particular guarantee that the usual inlining heuristics will perform well with recursive functions (where there are two different places, the call site and the recursive call, that might be inlined, with different benefits in each case). Furthermore, it's difficult at compile time to estimate how deep the recursion will go, and the inline heuristics might like to take account of the call depth to make decisions. So it may be that the compiler just doesn't bother.
Functional language compilers are typically a lot more aggressive dealing with recursion than C or C++ compilers. The relevant trade-off there is that so many functions written in functional languages are recursive, that performance might be hopeless if the compiler couldn't optimise tail-recursion. So Lisp programmers typically rely on good optimisation of recursive functions, whereas C and C++ programmers typically don't.
If your compiler does not support it, you can try manually inlining instead...
int factorial(int n) {
int result = 1;
if (n-- == 0) {
return result;
} else {
result *= 1;
if (n-- == 0) {
return result;
} else {
result *= 2;
if (n-- == 0) {
return result;
} else {
result *= 3;
if (n-- == 0) {
return result;
} else {
result *= 4;
if (n-- == 0) {
return result;
} else {
// ...
}
}
}
}
}
}
See the problem yet?
Tail recursion (a special case of recursion) it's possible to be inlined by smart compilers.
Now, hold on. A tail-recursive function could be unrolled and inlined pretty easily. Apparently there are compilers that do this, but I am not aware of specifics.
Of course. Any function can be inlined if it makes sense to do it:
int f(int i)
{
if (i <= 0) return 1;
else return i * f(i - 1);
}
int main()
{
return f(10);
}
pseudo assembly (f is inlined in main):
main:
mov r0, #10 ; Pass 10 to f
f:
cmp r0, #0 ; arg <= 0? ...
bge 1l
mov r0, #1 ; ... is so, return 1
ret
1:
mov r0, -(sp) ; if not, save arg.
dec r0 ; pass arg - 1 to f
call f ; just because it's inlined doesn't mean I can't call it.
mul r0, (sp)+ ; compute the result
ret ; done.
;-)
When you call an ordinary function when you change command sequential execution order and jump(call or jmp) into some address where the function resides. Inlining mean that you place in all occurences of this function the commands of this function, so you don't have a one place where you could jump, also other types of optimisations can be used, like elemination of pushing/popping function parameters.
When you know, that the recursive chain will in normal cases be not so long, you could do inlining upto a predefined level (I don't know, if any existing compiler is intelligent enough for this today).
Inlining a recursive function is much like unrolling a loop. You will end up with much duplicate code -- but in some cases it could be worthwhile:
The number of recursive calls (the length of the chain) is normally short (in cases it gets longer than predefined, just do normal recursion)
The overhead for the functions calls is relatively big compared to the logic -- so do some "unrolling" for example five instances and end up doing a recursive call again -- this would lead to saving 80% of the call overhead.
Off course the tail-recursive special-case -- but this was mentioned by others.
Of course can be declared inline. The inline keyword is just a hint to the compiler. In many case the compiler just ignore it and depending on the compiler this could be one of this situatios.
Some compilers cna turn tail recursion into plain loops, and thus inline them normally.
Non-tail recursion could be inlined up to a given depth, usually decided by the compiler.
I've never encountered a practical application for that, as the cost of call isn't high enough anymore to offset the increase in code size.
[edit] (to clarify that: even though I like to toy with these things, and often check what code my compiler generates for "funny stuff" just out of curiosity, I haven't encountered a use case where any such unrolling helped significantly. This doesn't mean they don't exist or couldn't be constructed.
The only place where it would help is precalculating low iterations during compile time. However, in my experience this immensely increases compile times for often negligible runtime performance benefits.
Note that Visual Studio 2008 (and earlier) gives you quite some control over this:
#pragma inline_recursion(on)
#pragma inline_depth(N)
__forceinline
Be careful with the latter, it can easily overload the compiler :)
Inline means that on each place a call to a function marked as inline gets done, the compiler places a copy of the said function code there. This avoids function calling mechanisms, and it's usual argument stack pushing-poping, saving time in gazillion-calls-per-second situations. You see the consequences to static variables and stuff like that? all gone...
So, if you had an inlined recursive call, either your compiler is super smart and figures whether the number of copies is deterministic, of it will say "Cannot make it inline", because it wouldn't know when to stop.
Consider an example like this:
if (flag)
for (condition)
do_something();
else
for (condition)
do_something_else();
If flag doesn't change inside the for loops, this should be semantically equivalent to:
for (condition)
if (flag)
do_something();
else
do_something_else();
Only in the first case, the code might be much longer (e.g. if several for loops are used or if do_something() is a code block that is mostly identical to do_something_else()), while in the second case, the flag gets checked many times.
I'm curious whether current C++ compilers (most importantly, g++) would be able to optimize the second example to get rid of the repeated tests inside the for loop. If so, under what conditions is this possible?
Yes, if it is determined that flag doesn't change and can't be changed by do_something or do_something_else, it can be pulled outside the loop. I've heard of this called loop hoisting, but Wikipedia has an entry called "loop invariant code motion".
If flags is a local variable, the compiler should be able to do this optimization since it's guaranteed to have no effect on the behavior of the generated code.
If flags is a global variable, and you call functions inside your loop it might not perform the optimization - it may not be able to determine if those functions modify the global.
This can also be affected by the sort of optimization you do - optimizing for size would favor the non-hoisted version while optimizing for speed would probably favor the hoisted version.
In general, this isn't the sort of thing that you should worry about, unless profiling tells you that the function is a hotspot and you see that less than efficient code is actually being generated by going over the assembly the compiler outputs. Micro-optimizations like this you should always just leave to the compiler unless you absolutely have to.
Tried with GCC and -O3:
void foo();
void bar();
int main()
{
bool doesnt_change = true;
for (int i = 0; i != 3; ++i) {
if (doesnt_change) {
foo();
}
else {
bar();
}
}
}
Result for main:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
call __Z3foov
call __Z3foov
call __Z3foov
xorl %eax, %eax
leave
ret
So it does optimize away the choice (and unrolls smaller loops).
This optimization is not done if doesnt_change is global.
I'm sure if the compiler can determine that the flag will remain constant, it can do some shufflling:
const bool flag = /* ... */;
for (..;..;..;)
{
if (flag)
{
// ...
}
else
{
// ...
}
}
If the flag is not const, the compiler cannot necessarily optimize the loop, because it can't be sure flag won't change. It can if it does static analysis, but not all compilers do, I think. const is the sure-fire way of telling the compiler the flag won't change, after that it's up to the compiler.
As usual, profile and find out if it's really a problem.
I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread?
That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.
As many have said: it depends.
If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this:
for (condition)
do_it<flag>();
Generally, yes. But there is no guarantee, and the places where the compiler will do it are probably rare.
What most compilers do without a problem is hoisting immutable evaluations out of the loop, e.g. if your condition is
if (a<b) ....
when a and b are not affected by the loop, the comparison will be made once before the loop.
This means if the compiler can determine the condition does not change, the test is cheap and the jump wenll predicted. This in turn means the test itself costs one cycle or no cycle at all (really).
In which cases splitting the loop would be beneficial?
a) a very tight loop where the 1 cycle is a significant cost
b) the entire loop with both parts does not fit the code cache
Now, the compiler can only make assumptions about the code cache, and usually can order the code in a way that one branch will fit the cache.
Without any testing, I'dexpect a) the only case where such an optimization would be applied, becasue it's nto always the better choice:
In which cases splitting the loop would be bad?
When splitting the loop increases code size beyond the code cache, you will take a significant hit. Now, that only affects you if the loop itself is called within another loop, but that's something the compiler usually can't determine.
[edit]
I couldn't get VC9 to split the following loop (one of the few cases where it might actually be beneficial)
extern volatile int vflag = 0;
int foo(int count)
{
int sum = 0;
int flag = vflag;
for(int i=0; i<count; ++i)
{
if (flag)
sum += i;
else
sum -= i;
}
return sum;
}
[edit 2]
note that with int flag = true; the second branch does get optimized away. (and no, const doesn't make a difference here ;))
What does that mean? Either it doesn't support that, it doesn't matter, ro my analysis is wrong ;-)
Generally, I'd asume this is an optimization that is valuable only in a very few cases, and can be done by hand easily in most scenarios.
It's called a loop invariant and the optimization is called loop invariant code motion and also code hoisting. The fact that it's in a conditional will definitely make the code analysis more complex and the compiler may or may not invert the loop and the conditional depending on how clever the optimizer is.
There is a general answer for any specific case of this kind of question, and that's to compile your program and look at the generated code.