Can C++ compilers optimize "if" statements inside "for" loops?

Can C++ compilers optimize "if" statements inside "for" loops? - c++

Consider an example like this:
if (flag)
for (condition)
do_something();
else
for (condition)
do_something_else();
If flag doesn't change inside the for loops, this should be semantically equivalent to:
for (condition)
if (flag)
do_something();
else
do_something_else();
Only in the first case, the code might be much longer (e.g. if several for loops are used or if do_something() is a code block that is mostly identical to do_something_else()), while in the second case, the flag gets checked many times.
I'm curious whether current C++ compilers (most importantly, g++) would be able to optimize the second example to get rid of the repeated tests inside the for loop. If so, under what conditions is this possible?

Yes, if it is determined that flag doesn't change and can't be changed by do_something or do_something_else, it can be pulled outside the loop. I've heard of this called loop hoisting, but Wikipedia has an entry called "loop invariant code motion".
If flags is a local variable, the compiler should be able to do this optimization since it's guaranteed to have no effect on the behavior of the generated code.
If flags is a global variable, and you call functions inside your loop it might not perform the optimization - it may not be able to determine if those functions modify the global.
This can also be affected by the sort of optimization you do - optimizing for size would favor the non-hoisted version while optimizing for speed would probably favor the hoisted version.
In general, this isn't the sort of thing that you should worry about, unless profiling tells you that the function is a hotspot and you see that less than efficient code is actually being generated by going over the assembly the compiler outputs. Micro-optimizations like this you should always just leave to the compiler unless you absolutely have to.

Tried with GCC and -O3:
void foo();
void bar();
int main()
{
bool doesnt_change = true;
for (int i = 0; i != 3; ++i) {
if (doesnt_change) {
foo();
}
else {
bar();
}
}
}
Result for main:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
call __Z3foov
call __Z3foov
call __Z3foov
xorl %eax, %eax
leave
ret
So it does optimize away the choice (and unrolls smaller loops).
This optimization is not done if doesnt_change is global.

I'm sure if the compiler can determine that the flag will remain constant, it can do some shufflling:
const bool flag = /* ... */;
for (..;..;..;)
{
if (flag)
{
// ...
}
else
{
// ...
}
}
If the flag is not const, the compiler cannot necessarily optimize the loop, because it can't be sure flag won't change. It can if it does static analysis, but not all compilers do, I think. const is the sure-fire way of telling the compiler the flag won't change, after that it's up to the compiler.
As usual, profile and find out if it's really a problem.

I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread?
That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.

As many have said: it depends.
If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this:
for (condition)
do_it<flag>();

Generally, yes. But there is no guarantee, and the places where the compiler will do it are probably rare.
What most compilers do without a problem is hoisting immutable evaluations out of the loop, e.g. if your condition is
if (a<b) ....
when a and b are not affected by the loop, the comparison will be made once before the loop.
This means if the compiler can determine the condition does not change, the test is cheap and the jump wenll predicted. This in turn means the test itself costs one cycle or no cycle at all (really).
In which cases splitting the loop would be beneficial?
a) a very tight loop where the 1 cycle is a significant cost
b) the entire loop with both parts does not fit the code cache
Now, the compiler can only make assumptions about the code cache, and usually can order the code in a way that one branch will fit the cache.
Without any testing, I'dexpect a) the only case where such an optimization would be applied, becasue it's nto always the better choice:
In which cases splitting the loop would be bad?
When splitting the loop increases code size beyond the code cache, you will take a significant hit. Now, that only affects you if the loop itself is called within another loop, but that's something the compiler usually can't determine.
[edit]
I couldn't get VC9 to split the following loop (one of the few cases where it might actually be beneficial)
extern volatile int vflag = 0;
int foo(int count)
{
int sum = 0;
int flag = vflag;
for(int i=0; i<count; ++i)
{
if (flag)
sum += i;
else
sum -= i;
}
return sum;
}
[edit 2]
note that with int flag = true; the second branch does get optimized away. (and no, const doesn't make a difference here ;))
What does that mean? Either it doesn't support that, it doesn't matter, ro my analysis is wrong ;-)
Generally, I'd asume this is an optimization that is valuable only in a very few cases, and can be done by hand easily in most scenarios.

It's called a loop invariant and the optimization is called loop invariant code motion and also code hoisting. The fact that it's in a conditional will definitely make the code analysis more complex and the compiler may or may not invert the loop and the conditional depending on how clever the optimizer is.
There is a general answer for any specific case of this kind of question, and that's to compile your program and look at the generated code.

Related

Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness.
void clobber() {
asm volatile("" : : : "memory");
}
void escape(void* p) {
asm volatile("" : : "g"(p) : "memory");
}
These use inline assembly statements to change the assumptions of the optimizer.
The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won't look into it because it's asm volatile. It believes it when we tell it the code might read and write everywhere in memory. This effectively prevents the optimizer from reordering or discarding memory writes prior to the call to clobber, and forces memory reads after the call to clobber†.
The one in escape, additionally makes the pointer p visible to the assembly block. Again, because the optimizer won't look into the actual inline assembly code that code can be empty, and the optimizer will still assume that the block uses the address pointed by the pointer p. This effectively forces whatever p points to be in memory and not not in a register, because the assembly block might perform a read from that address.
(This is important because the clobber function won't force reads nor writes for anything that the compilers decides to put in a register, since the assembly statement in clobber doesn't state that anything in particular must be visible to the assembly.)
All of this happens without any additional code being generated directly by these "barriers". They are purely compile-time artifacts.
These use language extensions supported in GCC and in Clang, though. Is there a way to have similar behaviour when using MSVC?
† To understand why the optimizer has to think this way, imagine if the assembly block were a loop adding 1 to every byte in memory.

Given your approximation of escape(), you should also be fine with the following approximation of clobber() (note that this is a draft idea, deferring some of the solution to the implementation of the function nextLocationToClobber()):
// always returns false, but in an undeducible way
bool isClobberingEnabled();
// The challenge is to implement this function in a way,
// that will make even the smartest optimizer believe that
// it can deliver a valid pointer pointing anywhere in the heap,
// stack or the static memory.
volatile char* nextLocationToClobber();
const bool clobberingIsEnabled = isClobberingEnabled();
volatile char* clobberingPtr;
inline void clobber() {
if ( clobberingIsEnabled ) {
// This will never be executed, but the compiler
// cannot know about it.
clobberingPtr = nextLocationToClobber();
*clobberingPtr = *clobberingPtr;
}
}
UPDATE
Question: How would you ensure that isClobberingEnabled returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind?
Answer: We can take advantage of a hard-to-prove property from the number theory, for example, Fermat's Last Theorem:
bool undeducible_false() {
// It took mathematicians more than 3 centuries to prove Fermat's
// last theorem in its most general form. Hardly that knowledge
// has been put into compilers (or the compiler will try hard
// enough to check all one million possible combinations below).
// Caveat: avoid integer overflow (Fermat's theorem
// doesn't hold for modulo arithmetic)
std::uint32_t a = std::clock() % 100 + 1;
std::uint32_t b = std::rand() % 100 + 1;
std::uint32_t c = reinterpret_cast<std::uintptr_t>(&a) % 100 + 1;
return a*a*a + b*b*b == c*c*c;
}

I have used the following in place of escape.
#ifdef _MSC_VER
#pragma optimize("", off)
template <typename T>
inline void escape(T* p) {
*reinterpret_cast<char volatile*>(p) =
*reinterpret_cast<char const volatile*>(p); // thanks, #milleniumbug
}
#pragma optimize("", on)
#endif
It's not perfect but it's close enough, I think.
Sadly, I don't have a way to emulate clobber.

Why doesn't my C++ compiler optimize these memory writes away?

I created this program. It does nothing of interest but use processing power.
Looking at the output with objdump -d, I can see the three rand calls and corresponding mov instructions near the end even when compiling with O3 .
Why doesn't the compiler realize that memory isn't going to be used and just replace the bottom half with while(1){}? I'm using gcc, but I'm mostly interested in what is required by the standard.
/*
* Create a program that does nothing except slow down the computer.
*/
#include <cstdlib>
#include <unistd.h>
int getRand(int max) {
return rand() % max;
}
int main() {
for (int thread = 0; thread < 5; thread++) {
fork();
}
int len = 1000;
int *garbage = (int*)malloc(sizeof(int)*len);
for (int x = 0; x < len; x++) {
garbage[x] = x;
}
while (true) {
garbage[getRand(len)] = garbage[getRand(len)] - garbage[getRand(len)];
}
}

Because GCC isn't smart enough to perform this optimization on dynamically allocated memory. However, if you change garbageto be a local array instead, GCC compiles the loop to this:
.L4:
call rand
call rand
call rand
jmp .L4
This just calls rand repeatedly (which is needed because the call has side effects), but optimizes out the reads and writes.
If GCC was even smarter, it could also optimize out the randcalls, because its side effects only affect any later randcalls, and in this case there aren't any. However, this sort of optimization would probably be a waste of compiler writers' time.

It can't, in general, tell that rand() doesn't have observable side-effects here, and it isn't required to remove those calls.
It could remove the writes, but it may be the use of arrays is enough to suppress that.
The standard neither requires nor prohibits what it is doing. As long as the program has the correct observable behaviour any optimisation is purely a quality of implementation matter.

This code causes undefined behaviour because it has an infinite loop with no observable behaviour. Therefore any result is permissible.
In C++14 the text is 1.10/27:
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
[Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. —end note ]
I wouldn't say that rand() counts as an I/O function.
Related question

Leave it a chance to crash by array overflow ! The compiler won't speculate on the range of outputs of getRand.

Prevent C++11 removal of endless loops

As discussed in this question, C++11 optimizes endless loops away.
However, in embedded devices which have a single purpose, endless loops make sense and are actually quite often used. Even a completely empty while(1); is useful for a watchdog-assisted reset. Terminating but empty loops can also be useful in embedded development.
Is there an elegant way to specifically tell the compiler to not remove empty or endless loops, without disabling optimization altogether?

One of the requirements for a loop to be removed (as mentioned in that question) is that it
does not access or modify volatile objects
So,
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
should do the trick, although I would certainly verify this by looking at the disassembly of a program produced with your particular toolchain.
A function like this would be a good candidate for GCC's noreturn attribute as well.
void wait_forever(void) __attribute__ ((noreturn));
void wait_forever(void)
{
volatile int i = 1;
while (i) ;
}
int main(void)
{
if (something_bad_happened)
wait_forever();
}

Optimization and multithreading in B.Stroustrup's new book

Please refer to section 41.2.2 Instruction Reordering of "TCPL" 4th edition by B.Stroustrup, which I transcribe below:
To gain performance, compilers, optimizers, and hardware reorder
instructions. Consider:
// thread 1:
int x;
bool x_init;
void init()
{
x = initialize(); // no use of x_init in initialize()
x_init = true;
// ...
}
For this piece of code there is no stated reason to assign to x before
assigning to x_init. The optimizer (or the hardware instruction
scheduler) may decide to speed up the program by executing x_init =
true first. We probably meant for x_init to indicate whether x had
been initialized by initializer() or not. However, we did not say
that, so the hardware, the compiler, and the optimizer do not know
that.
Add another thread to the program:
// thread 2:
extern int x;
extern bool x_init;
void f2()
{
int y;
while (!x_init) // if necessary, wait for initialization to complete
this_thread::sleep_for(milliseconds{10});
y = x;
// ...
}
Now we have a problem: thread 2 may never wait and thus will assign an
uninitialized x to y. Even if thread 1 did not set x_init and x in
‘‘the wrong order,’’ we still may have a problem. In thread 2, there
are no assignments to x_init, so an optimizer may decide to lift the
evaluation of !x_init out of the loop, so that thread 2 either never
sleeps or sleeps forever.
Does the Standard allow the reordering in thread 1? (some quote from the Standard would be forthcoming) Why would that speed up the program?
Both answers in this discussion on SO seem to indicate that no such optimization occurs when there are global variables in the code, as x_init above.
What does the author mean by "to lift the evaluation of !x_init out of the loop"? Is this something like this?
if( !x_init ) while(true) this_thread::sleep_for(milliseconds{10});
y = x;

This is not so much a issue of the C++ compiler/standard, but that of modern CPUs. Have a look here. The compiler isn't going to emit memory barrier instructions between the assignments of x and x_init unless you tell it to.
For what it is worth, prior to C++11, the standard had no notion of multi threading in it's abstract machine model. Things are a bit nicer these days.

The C++11 standard does not "allow" or "prevent" reordering. It specifies some way to force a specific "barrier" that, it turns, prevent the compiler to move instructions before/after them. A compiler, in this example might reorder the assignment because it might be more efficient on a CPU with multiple computing unit (ALU/Hyperthreading/etc...) even with a single core. Typically, if your CPU has 2 ALU that can work in parallel, there is no reason the compiler would not try to feed them with as much work as it can.
I'm not speaking of out-of-order reordering of the CPU instructions that's done internally in Intel CPU (for example), but compile time ordering to ensure all the computing resources are busy doing some work.
I think it depends on the compiler compilation flags. Typically unless you tell it to, the compiler must assume that another compilation unit (say B.cpp, which is not visible at compile time) can have a "extern bool x_init", and can change it at any time. Then, the re-ordering optimization would break with the expected behavior (B can define the initialize() function). This example is trivial and unlikely to break. The linked SO answer are not related to this "optimization", but simply that, in their case, the compiler can not make the assumption that the global array is not modified externally, and as such can not make the optimization. This is not like your example.
Yes. It's a very common optimization trick, instead of:
// test is a bool
for (int i = 0; i < 345; i++) {
if (test) do_something();
}
The compiler might do:
if (test) for(int i = 0; i < 345; i++) { do_something(); }
And save 344 useless tests.

C or C++ : for loop variable

My question is a very basic one. In C or C++:
Let's say the for loop is as follows,
for(int i=0; i<someArray[a+b]; i++) {
....
do operations;
}
My question is whether the calculation a+b, is performed for each for loop or it is computed only once at the beginning of the loop?
For my requirements, the value a+b is constant. If a+b is computed and the value someArray[a+b]is accessed each time in the loop, I would use a temporary variable for someArray[a+b]to get better performance.

You can find out, when you look at the generated code
g++ -S file.cpp
and
g++ -O2 -S file.cpp
Look at the output file.s and compare the two versions. If someArray[a+b] can be reduced to a constant value for all loop cycles, the optimizer will usually do so and pull it out into a temporary variable or register.

It will behave as if it was computed each time. If the compiler is optimising and is capable of proving that the result does not change, it is allowed to move the computation out of the loop. Otherwise, it will be recomputed each time.
If you're certain the result is constant, and speed is important, use a variable to cache it.

is performed for each for loop or it is computed only once at the beginning of the loop?
If the compiler is not optimizing this code then it will be computed each time. Safer is to use a temporary variable it should not cost too much.

First, the C and C++ standards do not specify how an implementation must evaluate i<someArray[a+b], just that the result must be as if it were performed each iteration (provided the program conforms to other language requirements).
Second, any C and C++ implementation of modest quality will have the goal of avoiding repeated evaluation of expressions whose value does not change, unless optimization is disabled.
Third, several things can interfere with that goal, including:
If a, b, or someArray are declared with scope visible outside the function (e.g., are declared at file scope) and the code in the loop calls other functions, the C or C++ implementation may be unable to determine whether a, b, or someArray are altered during the loop.
If the address of a, b, or someArray or its elements is taken, the C or C++ implementation may be unable to determine whether that address is used to alter those objects. This includes the possibility that someArray is an array passed into the function, so its address is known to other entities outside the function.
If a, b, or the elements of someArray are volatile, the C or C++ implementation must assume they can be changed at any time.
Consider this code:
void foo(int *someArray, int *otherArray)
{
int a = 3, b = 4;
for(int i = 0; i < someArray[a+b]; i++)
{
… various operations …
otherArray[i] = something;
}
}
In this code, the C or C++ implementation generally cannot know whether otherArray points to the same array (or an overlapping part) as someArray. Therefore, it must assume that otherArray[i] = something; may change someArray[a+b].
Note that I have answered regarding the larger expression someArray[a+b] rather than just the part you asked about, a+b. If you are only concerned about a+b, then only the factors that affect a and b are relevant, obviously.

Depends on how good the compiler is, what optimization levels you use and how a and b are declared.
For example, if a and/or b has volatile qualifier then compiler has to read it/them everytime. In that case, compiler can't choose to optimize it with the value of a+b. Otherwise, look at the code generated by the compiler to understand what your compiler does.
There's no standard behaviour on how this is calculated in neither C not C++.

I will bet that if a and b do not change over the loop it is optimized. Moreover, if someArray[a+b] is not touched it is also optimized. This is actually more important since since fetching operations are quite expensive.
That is with any half-decent compiler with most basic optimizations. I will also go as far as saying that people who say it does always evaluate are plain wrong. It is not always for certain, and it is most probably optimized whenever possible.

The calculation is performed each for loop. Although the optimizer can be smart and optimize it out, you would be better off with something like this:
// C++ lets you create a const reference; you cannot do it in C, though
const some_array_type &last(someArray[a+b]);
for(int i=0; i<last; i++) {
...
}

It calculates every time. Use a variable :)

You can compile it and check the assembly code just to make sure.
But I think most compilers are clever enough to optimize this kind of stuff. (If you are using some optimization flag)

It might be calculated every time or it might be optimised. It will depends on whether a and b exist in a scope that the compiler can guarantee that no external function can change their values. That is, if they are in a global context, the compiler cannot guarantee that a function you call in the loop will modify them (unless you don't call any functions). If they are only in local context, then the compiler can attempt to optimise that calculation away.
Generating both optimised and unoptimised assembly code is the easiest way to check. However, the best thing to do is not care because the cost of that sum is so incredibly cheap. Modern processors are very very fast and the thing that is slow is pulling in data from RAM to the cache. If you want to optimised your code, profile it; don't guess.

The calculation a+b would be carried out every iteration of the loop, and then the lookup into someArray is carried out every iteration of the loop, so you could probably save a lot of processor time by having a temporary variable set outside the loop, for example(if the array is an array of ints say):
int myLoopVar = someArray[a+b]
for(int i=0; i<myLoopVar; i++)
{
....
do operations;
}
Very simplified explanation:
If the value at array position a+b were a mere 5 for example, that would be 5 calculations and 5 lookups, so 10 operations, which would be replaced by 8 by using a variable outside the loop (5 accesses (1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable) not so great a saving. If however you are dealing with larger values, for example the value stored in the array at a+b id 100, you would potentially be doing 100 calculations and 100 lookups, versus 103 operations if you have a variable outside the loop (100 accesses(1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable).
The majority of the above however is dependant on the compiler: depending upon which switches you utilise, what optimisations the compiler can apply automatically etc., the code may well be optimised without you having to do any changes to your code. Best thing to do is weigh up the pros and cons of each approach specifically for your current implementation, as what may suit a large number of iterations may not be most efficient for a small number, or perhaps memory may be an issue which would dictate a differing style to your program . . . Well you get the idea :)
Let me know if you need any more info:)

for the following code:
int a = 10, b = 10;
for(int i=0; i< (a+b); i++) {} // a and b do not change in the body of loop
you get the following assembly:
L3:
addl $1, 12(%esp) ;increment i
L2:
movl 4(%esp), %eax ;move a to reg AX
movl 8(%esp), %edx ;move b to reg BX
addl %edx, %eax ;AX = AX + BX, i.e. AX = a + b
cmpl 12(%esp), %eax ;compare AX with i
jg L3 ;if AX > i, jump to L3 label
if you apply the compiler optimization, you get the following assembly:
movl $20, %eax ;move 20 (a+b) to AX
L3:
subl $1, %eax ;decrement AX
jne L3 ;jump if condition met
movl $0, %eax ;move 0 to AX
basically, in this case, with my compiler (MinGW 4.8.0), the loop will do "the calculation" regardless of whether you're changing the conditional variables within the loop or not (haven't posted assembly for this, but take my word for it, or even better, don't and disassemble the code yourself).
when you apply the optimization, the compiler will do some magic and churn out a set of instructions that are completely unrecognizable.
if you dont feel like optimizing your loop through a compiler action (-On), then declaring one variable and assigning it a+b will reduce your assembly by an instruction or two.
int a = 10, b = 10;
const int c = a + b;
for(int i=0; i< c; i++) {}
assembly:
L3:
addl $1, 12(%esp)
L2:
movl 12(%esp), %eax
cmpl (%esp), %eax
jl L3
movl $0, %eax
keep in mind, the assembly code i posted here is only the relevant snippet, there's a bit more, but it's not relevant as far as the question goes

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can C++ compilers optimize "if" statements inside "for" loops? - c++

I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread? That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.

As many have said: it depends. If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this: for (condition) do_it<flag>();

Related

Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

Why doesn't my C++ compiler optimize these memory writes away?

Prevent C++11 removal of endless loops

Optimization and multithreading in B.Stroustrup's new book

C or C++ : for loop variable

Categories

Resources