Is it possible to have a loop which has a zero execution time? I would think that even an empty loop should have an execution time since there is an overhead associated with it.
Yes, under the as-if rule the compiler is only obligated to emulate the observable behavior of the code, so if you have a loop that does not have any observable behavior then it can be optimized away completely and therefore will effectively have zero execution time.
Examples
For example the following code:
int main()
{
int j = 0 ;
for( int i = 0; i < 10000; ++i )
{
++j ;
}
}
compiled with gcc 4.9 using the -O3 flag basically ends up reducing to the following (see it live):
main:
xorl %eax, %eax #
ret
Pretty much all optimizations allowed fall under the as-if rule, the only exception I am aware of is copy elison which is allowed to effect the observable behavior.
Some other examples would include dead code elimination which can remove code that the compiler can prove will never be executed. For example even though the following loop does indeed contain a side effect it can be optimized out since we can prove it will never be executed (see it live):
#include <stdio.h>
int main()
{
int j = 0 ;
if( false ) // The loop will never execute
{
for( int i = 0; i < 10000; ++i )
{
printf( "%d\n", j ) ;
++j ;
}
}
}
The loop will optimize away the same as the previous example. A more interesting example would be the case where a calculation in a loop can be deduced into a constant thereby avoiding the need for a loop(not sure what optimization category this falls under), for example:
int j = 0 ;
for( int i = 0; i < 10000; ++i )
{
++j ;
}
printf( "%d\n", j ) ;
can be optimized away to (see it live):
movl $10000, %esi #,
movl $.LC0, %edi #,
xorl %eax, %eax #
call printf #
We can see there is no loop involved.
Where is as-if Rule covered in the standard
The as-if rule is covered in the draft C99 standard section 5.1.2.3 Program execution which says:
In the abstract machine, all expressions are evaluated as specified by
the semantics. An actual implementation need not evaluate part of an
expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).
The as-if rule also applies to C++, gcc will produce the same result in C++ mode as well. The C++ draft standard covers this in section 1.9 Program execution:
The semantic descriptions in this International Standard define a
parameterized nondeterministic abstract machine. This International
Standard places no requirement on the structure of conforming
implementations. In particular, they need not copy or emulate the
structure of the abstract machine. Rather, conforming implementations
are required to emulate (only) the observable behavior of the abstract
machine as explained below.5
Yes - if the compiler determines that the loop is dead code (will never execute) then it will not generate code for it. That loop will have 0 execution time, although strictly speaking it doesn't exist at the machine code level.
As well as compiler optimisations, some CPU architectures, particularly DSPs, have zero overhead looping, whereby a loop with a fixed number of iterations is effectively optimised away by the hardware, see e.g. http://www.dsprelated.com/showmessage/20681/1.php
The compiler is not obliged to evaluate the expression, or a portion of an expression, that has no side effects and whose result is discarded.
Harbison and Steele, C: A Reference Manual
Related
I created this program. It does nothing of interest but use processing power.
Looking at the output with objdump -d, I can see the three rand calls and corresponding mov instructions near the end even when compiling with O3 .
Why doesn't the compiler realize that memory isn't going to be used and just replace the bottom half with while(1){}? I'm using gcc, but I'm mostly interested in what is required by the standard.
/*
* Create a program that does nothing except slow down the computer.
*/
#include <cstdlib>
#include <unistd.h>
int getRand(int max) {
return rand() % max;
}
int main() {
for (int thread = 0; thread < 5; thread++) {
fork();
}
int len = 1000;
int *garbage = (int*)malloc(sizeof(int)*len);
for (int x = 0; x < len; x++) {
garbage[x] = x;
}
while (true) {
garbage[getRand(len)] = garbage[getRand(len)] - garbage[getRand(len)];
}
}
Because GCC isn't smart enough to perform this optimization on dynamically allocated memory. However, if you change garbageto be a local array instead, GCC compiles the loop to this:
.L4:
call rand
call rand
call rand
jmp .L4
This just calls rand repeatedly (which is needed because the call has side effects), but optimizes out the reads and writes.
If GCC was even smarter, it could also optimize out the randcalls, because its side effects only affect any later randcalls, and in this case there aren't any. However, this sort of optimization would probably be a waste of compiler writers' time.
It can't, in general, tell that rand() doesn't have observable side-effects here, and it isn't required to remove those calls.
It could remove the writes, but it may be the use of arrays is enough to suppress that.
The standard neither requires nor prohibits what it is doing. As long as the program has the correct observable behaviour any optimisation is purely a quality of implementation matter.
This code causes undefined behaviour because it has an infinite loop with no observable behaviour. Therefore any result is permissible.
In C++14 the text is 1.10/27:
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
[Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. —end note ]
I wouldn't say that rand() counts as an I/O function.
Related question
Leave it a chance to crash by array overflow ! The compiler won't speculate on the range of outputs of getRand.
If there was similar questions please direct me there, I searched quiet some time but didn't find anything.
Backround:
I was just playing around and found some behavior I can't completely explain...
For primitive types, it looks like when there's an implicit conversion, the assignment operator = takes longer time, compared to an explicit assignment.
int iTest = 0;
long lMax = std::numeric_limits<long>::max();
for (int i=0; i< 100000; ++i)
{
// I had 3 such loops, each running 1 of the below lines.
iTest = lMax;
iTest = (int)lMax;
iTest = static_cast<int>(lMax);
}
The result is that the c style cast and c++ style static_cast performs the same on average (differs each time, but no visible difference). AND They both outperforms the implicit assignment.
Result:
iTest=-1, lMax=9223372036854775807
(iTest = lMax) used 276 microseconds
iTest=-1, lMax=9223372036854775807
(iTest = (int)lMax) used 191 microseconds
iTest=-1, lMax=9223372036854775807
(iTest = static_cast<int>(lMax)) used 187 microseconds
Question:
Why is the implicit conversion results in larger latency? I can guess it has to be detected in the assignment that int overflows, so adjusted to -1. But what exactly is going on in the assignment?
Thanks!
If you want to know why something is happening under the covers, the best place to look is ... wait for it ... under the covers :-)
That means examining the assembler language that is produced by your compiler.
A C++ environment is best thought of as an abstract machine for running C++ code. The standard (mostly) dictates behaviour rather than implementation details. Once you leave the bounds of the standard and start thinking about what happens underneath, the C++ source code is of little help anymore - you need to examine the actual code that the computer is running, the stuff output by the compiler (usually machine code).
It may be that the compiler is throwing away the loop because it's calculating the same thing every time so only needs do it once. It may be that it throws away the code altogether if it can determine you don't use the result.
There was a time many moons ago, when the VAX Fortran compiler (I did say many moons) outperformed its competitors by several orders of magnitude in a given benchmark.
That was for that exact reason. It had determined the results of the loop weren't used so had optimised the entire loop out of existence.
The other thing you might want to watch out for is the measuring tools themselves. When you're talking about durations of 1/10,000th of a second, your results can be swamped by the slightest bit of noise.
There are ways to alleviate these effects such as ensuring the thing you're measuring is substantial (over ten seconds for example), or using statistical methods to smooth out any noise.
But the bottom line is, it may be the measuring methodology causing the results you're seeing.
#include <limits>
int iTest = 0;
long lMax = std::numeric_limits<long>::max();
void foo1()
{
iTest = lMax;
}
void foo2()
{
iTest = (int)lMax;
}
void foo3()
{
iTest = static_cast<int>(lMax);
}
Compiled with GCC 5 using -O3 yields:
__Z4foo1v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
__Z4foo2v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
__Z4foo3v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
They are all exactly the same.
Since you didn't provide a complete example I can only guess that the difference is due to something you aren't showing us.
for(int i = 0; i < my_function(MY_CONSTANT); ++i){
//code using i
}
In this example, will my_function(MY_CONSTANT) be evaluated at each iteration, or will it be stored automatically? Would this depend on the optimization flags used?
It has to work as if the function is called each time.
However, if the compiler can prove that the function result will be the same each time, it can optimize under the “as if” rule.
E.g. this usually happens with calls to .end() for standard containers.
General advice: when in doubt about whether to micro-optimize a piece of code,
Don't do it.
If you're still thinking of doing it, measure.
Well there was a third point but I've forgetting, maybe it was, still wait.
In other words, decide whether to use a variable based on how clear the code then is, not on imagined performance.
It will be evaluated each iteration. You can save the extra computation time by doing something like
const int stop = my_function(MY_CONSTANT);
for(int i = 0; i < stop; ++i){
//code using i
}
A modern optimizing compiler under the as-if rule may be able to optimize away the function call in the case that you outlined in your comment here. The as-if rule says that conforming compiler only has the emulate the observable behavior, we can see this by going to the draft C++ standard section 1.9 Program execution which says:
[...]Rather, conforming implementations are required to emulate (only)
the observable behavior of the abstract machine as explained below.5
So if you are using a constant expression and my_function does not have observable side effects it could be optimized out. We can put together a simple test (see it live on godbolt):
#include <stdio.h>
#define blah 10
int func( int x )
{
return x + 20 ;
}
void withConstant( int y )
{
for(int i = 0; i < func(blah); i++)
{
printf("%d ", i ) ;
}
}
void withoutConstant(int y)
{
for(int i = 0; i < func(i+y); i++)
{
printf("%d ", i ) ;
}
}
In the case of withConstant we can see it optimizes the computation:
cmpl $30, %ebx #, i
and even in the case of withoutConstant it inlines the calculation instead of performing a function call:
leal 0(%rbp,%rbx), %eax #, D.2605
If my_function is declared constexpr and the argument is really a constant, the value is calculated at compile time and thereby fulfilling the "as-if" and "sequential-consistency with no data-race" rule.
constexpr my_function(const int c);
If your function has side effects it would prevent the compiler from moving it out of the for-loop as it would not fulfil the "as-if" rule, unless the compiler can reason its way out of it.
The compiler might inline my_function, reduce on it as if it was part of the loop and with constant reduction find out that its really only a constant, de-facto removing the call and replacing it with a constant.
int my_function(const int c) {
return 17+c; // inline and constant reduced to the value.
}
So the answer to your question is ... maybe!
My question is a very basic one. In C or C++:
Let's say the for loop is as follows,
for(int i=0; i<someArray[a+b]; i++) {
....
do operations;
}
My question is whether the calculation a+b, is performed for each for loop or it is computed only once at the beginning of the loop?
For my requirements, the value a+b is constant. If a+b is computed and the value someArray[a+b]is accessed each time in the loop, I would use a temporary variable for someArray[a+b]to get better performance.
You can find out, when you look at the generated code
g++ -S file.cpp
and
g++ -O2 -S file.cpp
Look at the output file.s and compare the two versions. If someArray[a+b] can be reduced to a constant value for all loop cycles, the optimizer will usually do so and pull it out into a temporary variable or register.
It will behave as if it was computed each time. If the compiler is optimising and is capable of proving that the result does not change, it is allowed to move the computation out of the loop. Otherwise, it will be recomputed each time.
If you're certain the result is constant, and speed is important, use a variable to cache it.
is performed for each for loop or it is computed only once at the beginning of the loop?
If the compiler is not optimizing this code then it will be computed each time. Safer is to use a temporary variable it should not cost too much.
First, the C and C++ standards do not specify how an implementation must evaluate i<someArray[a+b], just that the result must be as if it were performed each iteration (provided the program conforms to other language requirements).
Second, any C and C++ implementation of modest quality will have the goal of avoiding repeated evaluation of expressions whose value does not change, unless optimization is disabled.
Third, several things can interfere with that goal, including:
If a, b, or someArray are declared with scope visible outside the function (e.g., are declared at file scope) and the code in the loop calls other functions, the C or C++ implementation may be unable to determine whether a, b, or someArray are altered during the loop.
If the address of a, b, or someArray or its elements is taken, the C or C++ implementation may be unable to determine whether that address is used to alter those objects. This includes the possibility that someArray is an array passed into the function, so its address is known to other entities outside the function.
If a, b, or the elements of someArray are volatile, the C or C++ implementation must assume they can be changed at any time.
Consider this code:
void foo(int *someArray, int *otherArray)
{
int a = 3, b = 4;
for(int i = 0; i < someArray[a+b]; i++)
{
… various operations …
otherArray[i] = something;
}
}
In this code, the C or C++ implementation generally cannot know whether otherArray points to the same array (or an overlapping part) as someArray. Therefore, it must assume that otherArray[i] = something; may change someArray[a+b].
Note that I have answered regarding the larger expression someArray[a+b] rather than just the part you asked about, a+b. If you are only concerned about a+b, then only the factors that affect a and b are relevant, obviously.
Depends on how good the compiler is, what optimization levels you use and how a and b are declared.
For example, if a and/or b has volatile qualifier then compiler has to read it/them everytime. In that case, compiler can't choose to optimize it with the value of a+b. Otherwise, look at the code generated by the compiler to understand what your compiler does.
There's no standard behaviour on how this is calculated in neither C not C++.
I will bet that if a and b do not change over the loop it is optimized. Moreover, if someArray[a+b] is not touched it is also optimized. This is actually more important since since fetching operations are quite expensive.
That is with any half-decent compiler with most basic optimizations. I will also go as far as saying that people who say it does always evaluate are plain wrong. It is not always for certain, and it is most probably optimized whenever possible.
The calculation is performed each for loop. Although the optimizer can be smart and optimize it out, you would be better off with something like this:
// C++ lets you create a const reference; you cannot do it in C, though
const some_array_type &last(someArray[a+b]);
for(int i=0; i<last; i++) {
...
}
It calculates every time. Use a variable :)
You can compile it and check the assembly code just to make sure.
But I think most compilers are clever enough to optimize this kind of stuff. (If you are using some optimization flag)
It might be calculated every time or it might be optimised. It will depends on whether a and b exist in a scope that the compiler can guarantee that no external function can change their values. That is, if they are in a global context, the compiler cannot guarantee that a function you call in the loop will modify them (unless you don't call any functions). If they are only in local context, then the compiler can attempt to optimise that calculation away.
Generating both optimised and unoptimised assembly code is the easiest way to check. However, the best thing to do is not care because the cost of that sum is so incredibly cheap. Modern processors are very very fast and the thing that is slow is pulling in data from RAM to the cache. If you want to optimised your code, profile it; don't guess.
The calculation a+b would be carried out every iteration of the loop, and then the lookup into someArray is carried out every iteration of the loop, so you could probably save a lot of processor time by having a temporary variable set outside the loop, for example(if the array is an array of ints say):
int myLoopVar = someArray[a+b]
for(int i=0; i<myLoopVar; i++)
{
....
do operations;
}
Very simplified explanation:
If the value at array position a+b were a mere 5 for example, that would be 5 calculations and 5 lookups, so 10 operations, which would be replaced by 8 by using a variable outside the loop (5 accesses (1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable) not so great a saving. If however you are dealing with larger values, for example the value stored in the array at a+b id 100, you would potentially be doing 100 calculations and 100 lookups, versus 103 operations if you have a variable outside the loop (100 accesses(1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable).
The majority of the above however is dependant on the compiler: depending upon which switches you utilise, what optimisations the compiler can apply automatically etc., the code may well be optimised without you having to do any changes to your code. Best thing to do is weigh up the pros and cons of each approach specifically for your current implementation, as what may suit a large number of iterations may not be most efficient for a small number, or perhaps memory may be an issue which would dictate a differing style to your program . . . Well you get the idea :)
Let me know if you need any more info:)
for the following code:
int a = 10, b = 10;
for(int i=0; i< (a+b); i++) {} // a and b do not change in the body of loop
you get the following assembly:
L3:
addl $1, 12(%esp) ;increment i
L2:
movl 4(%esp), %eax ;move a to reg AX
movl 8(%esp), %edx ;move b to reg BX
addl %edx, %eax ;AX = AX + BX, i.e. AX = a + b
cmpl 12(%esp), %eax ;compare AX with i
jg L3 ;if AX > i, jump to L3 label
if you apply the compiler optimization, you get the following assembly:
movl $20, %eax ;move 20 (a+b) to AX
L3:
subl $1, %eax ;decrement AX
jne L3 ;jump if condition met
movl $0, %eax ;move 0 to AX
basically, in this case, with my compiler (MinGW 4.8.0), the loop will do "the calculation" regardless of whether you're changing the conditional variables within the loop or not (haven't posted assembly for this, but take my word for it, or even better, don't and disassemble the code yourself).
when you apply the optimization, the compiler will do some magic and churn out a set of instructions that are completely unrecognizable.
if you dont feel like optimizing your loop through a compiler action (-On), then declaring one variable and assigning it a+b will reduce your assembly by an instruction or two.
int a = 10, b = 10;
const int c = a + b;
for(int i=0; i< c; i++) {}
assembly:
L3:
addl $1, 12(%esp)
L2:
movl 12(%esp), %eax
cmpl (%esp), %eax
jl L3
movl $0, %eax
keep in mind, the assembly code i posted here is only the relevant snippet, there's a bit more, but it's not relevant as far as the question goes
Haven't used C++ in a while. I've been depending on my Java compiler to do optimization.
What's is the most optimized way to do a for loop in C++? Or it is all the same now with moderm compilers? In the 'old days' there was a difference.
for (int i=1; i<=100; i++)
OR
int i;
for (i=1; i<=100; i++)
OR
int i = 1;
for ( ; i<=100; i++)
Is it the same in C?
EDIT:
Okay, so the overwhelming consensus is to use the first case and let the complier optimize with it if it want to.
I'd say that trivial things like this are probably optimized by the compiler, and you shouldn't worry about them. The first option is the most readable, so you should use that.
EDIT: Adding what other answers said, there is also the difference that if you declare the variable in the loop initializer, it will stop to exist after the loop ends.
The difference is scope.
for(int i = 1; i <= 100; ++i)
is generally preferable because then the scope of i is restricted to the for loop. If you declare it before the for loop, then it continues to exist after the for loop has finished and could clash with other variables. If you're only using it in the for loop, there's no reason to let it exist longer than that.
Let's say the original poster had a loop they really wanted optimized - every instruction counted. How can we figure out - empirically - the answer to his question?
gcc at least has a useful, if uncommonly used switch, '-S'. It dumps the assembly code version of the .c file and can be used to answer questions like the OP poses. I wrote a simple program:
int main( )
{
int sum = 0;
for(int i=1;i<=10;++i)
{
sum = sum + i;
}
return sum;
}
And ran: gcc -O0 -std=c99 -S main.c, creating the assembly version of the main program. Here's the contents of main.s (with some of the fluff removed):
movl $0, -8(%rbp)
movl $1, -4(%rbp)
jmp .L2
.L3:
movl -4(%rbp), %eax
addl %eax, -8(%rbp)
addl $1, -4(%rbp)
.L2:
cmpl $10, -4(%rbp)
jle .L3
You don't need to be an assembly expert to figure out what's going on. movl moves values, addl adds things, cmpl compares and jle stands for 'jump if less than', $ is for constants. It's loading 0 into something - that must be 'sum', 1 into something else - ah, 'i'! A jump to L2 where we do the compare to 10, jump to L3 to do the add. Fall through to L2 for the compare again. Neat! A for loop.
Change the program to:
int main( )
{
int sum = 0;
int i=1;
for( ;i<=10;++i)
{
sum = sum + i;
}
return sum;
}
Rerun gcc and the resultant assembly will be very similar. There's some stuff going on with recording line numbers, so they won't be identical, but the assembly ends up being the same. Same result with the last case. So, even without optimization, the code's just about the same.
For fun, rerun gcc with '-O3' instead of '-O0' to enable optimization and look at the .s file.
main:
movl $55, %eax
ret
gcc not only figured out we were doing a for loop, but also realized it was to be run a constant number of times did the loop for us at compile time, chucked out 'i' and 'sum' and hard coded the answer - 55! That's FAST - albeit a bit contrived.
Moral of the story? Spend your time on ensuring your code is clean and well designed. Code for readability and maintainability. The guys that live on mountain dew and cheetos are way smarter than us and have taken care of most of these simple optimization problems for us. Have fun!
It's the same. The compiler will optimize these to the same thing.
Even if they weren't the same, the difference compared to the actual body of your loop would be negligible. You shouldn't worry about micro-optimizations like this. And you shouldn't make micro-optimizations unless you are performance profiling to see if it actually makes a difference.
It's the same in term of speed. Compiler will optimize if you do not have a later use of i.
In terms of style - I'd put the definition in the loop construct, as it reduces the risk that you'll conflict if you define another i later.
Don't worry about micro-optimizations, let the compiler do it. Pick whichever is most readable. Note that declaring a variable within a for initialization statement scopes the variable to the for statement (C++03 § 6.5.3 1), though the exact behavior of compilers may vary (some let you pick). If code outside the loop uses the variable, declare it outside the loop. If the variable is truly local to the loop, declare it in the initializer.
It has already been mentioned that the main difference between the two is scope. Make sure you understand how your compiler handles the scope of an int declared as
for (int i = 1; ...;...)
I know that when using MSVC++6, i is still in scope outside the loop, just as if it were declared before the loop. This behavior is different from VS2005, and I'd have to check, but I think the last version of gcc that I used. In both of those compilers, that variable was only in scope inside the loop.
for(int i = 1; i <= 100; ++i)
This is easiest to read, except for ANSI C / C89 where it is invalid.
A c++ for loop is literally a packaged while loop.
for (int i=1; i<=100; i++)
{
some foobar ;
}
To the compiler, the above code is exactly the same as the code below.
{
int i=1 ;
while (i<=100){
some foobar ;
i++ ;
}
}
Note the int i=1 ; is contained within a dedicated scope that encloses only it and the while loop.
It's all the same.