Haven't used C++ in a while. I've been depending on my Java compiler to do optimization.
What's is the most optimized way to do a for loop in C++? Or it is all the same now with moderm compilers? In the 'old days' there was a difference.
for (int i=1; i<=100; i++)
OR
int i;
for (i=1; i<=100; i++)
OR
int i = 1;
for ( ; i<=100; i++)
Is it the same in C?
EDIT:
Okay, so the overwhelming consensus is to use the first case and let the complier optimize with it if it want to.
I'd say that trivial things like this are probably optimized by the compiler, and you shouldn't worry about them. The first option is the most readable, so you should use that.
EDIT: Adding what other answers said, there is also the difference that if you declare the variable in the loop initializer, it will stop to exist after the loop ends.
The difference is scope.
for(int i = 1; i <= 100; ++i)
is generally preferable because then the scope of i is restricted to the for loop. If you declare it before the for loop, then it continues to exist after the for loop has finished and could clash with other variables. If you're only using it in the for loop, there's no reason to let it exist longer than that.
Let's say the original poster had a loop they really wanted optimized - every instruction counted. How can we figure out - empirically - the answer to his question?
gcc at least has a useful, if uncommonly used switch, '-S'. It dumps the assembly code version of the .c file and can be used to answer questions like the OP poses. I wrote a simple program:
int main( )
{
int sum = 0;
for(int i=1;i<=10;++i)
{
sum = sum + i;
}
return sum;
}
And ran: gcc -O0 -std=c99 -S main.c, creating the assembly version of the main program. Here's the contents of main.s (with some of the fluff removed):
movl $0, -8(%rbp)
movl $1, -4(%rbp)
jmp .L2
.L3:
movl -4(%rbp), %eax
addl %eax, -8(%rbp)
addl $1, -4(%rbp)
.L2:
cmpl $10, -4(%rbp)
jle .L3
You don't need to be an assembly expert to figure out what's going on. movl moves values, addl adds things, cmpl compares and jle stands for 'jump if less than', $ is for constants. It's loading 0 into something - that must be 'sum', 1 into something else - ah, 'i'! A jump to L2 where we do the compare to 10, jump to L3 to do the add. Fall through to L2 for the compare again. Neat! A for loop.
Change the program to:
int main( )
{
int sum = 0;
int i=1;
for( ;i<=10;++i)
{
sum = sum + i;
}
return sum;
}
Rerun gcc and the resultant assembly will be very similar. There's some stuff going on with recording line numbers, so they won't be identical, but the assembly ends up being the same. Same result with the last case. So, even without optimization, the code's just about the same.
For fun, rerun gcc with '-O3' instead of '-O0' to enable optimization and look at the .s file.
main:
movl $55, %eax
ret
gcc not only figured out we were doing a for loop, but also realized it was to be run a constant number of times did the loop for us at compile time, chucked out 'i' and 'sum' and hard coded the answer - 55! That's FAST - albeit a bit contrived.
Moral of the story? Spend your time on ensuring your code is clean and well designed. Code for readability and maintainability. The guys that live on mountain dew and cheetos are way smarter than us and have taken care of most of these simple optimization problems for us. Have fun!
It's the same. The compiler will optimize these to the same thing.
Even if they weren't the same, the difference compared to the actual body of your loop would be negligible. You shouldn't worry about micro-optimizations like this. And you shouldn't make micro-optimizations unless you are performance profiling to see if it actually makes a difference.
It's the same in term of speed. Compiler will optimize if you do not have a later use of i.
In terms of style - I'd put the definition in the loop construct, as it reduces the risk that you'll conflict if you define another i later.
Don't worry about micro-optimizations, let the compiler do it. Pick whichever is most readable. Note that declaring a variable within a for initialization statement scopes the variable to the for statement (C++03 § 6.5.3 1), though the exact behavior of compilers may vary (some let you pick). If code outside the loop uses the variable, declare it outside the loop. If the variable is truly local to the loop, declare it in the initializer.
It has already been mentioned that the main difference between the two is scope. Make sure you understand how your compiler handles the scope of an int declared as
for (int i = 1; ...;...)
I know that when using MSVC++6, i is still in scope outside the loop, just as if it were declared before the loop. This behavior is different from VS2005, and I'd have to check, but I think the last version of gcc that I used. In both of those compilers, that variable was only in scope inside the loop.
for(int i = 1; i <= 100; ++i)
This is easiest to read, except for ANSI C / C89 where it is invalid.
A c++ for loop is literally a packaged while loop.
for (int i=1; i<=100; i++)
{
some foobar ;
}
To the compiler, the above code is exactly the same as the code below.
{
int i=1 ;
while (i<=100){
some foobar ;
i++ ;
}
}
Note the int i=1 ; is contained within a dedicated scope that encloses only it and the while loop.
It's all the same.
Related
I've been told that when we write int a[100] = {1};, the elements after 1 will be initialized to 0. But I didn't find out how this is done. (And is this true?)
I've tried the code below to roughly find out the time cost:
#define M 500000
clock_t begin = clock();
int main(){
/*-1-*/ //int a[M] = {1};
/*-2-*/ //int a[M]; memset(a, 0, sizeof(int) * M);
/*-3-*/ //int a[M]; for(int i = 0; i < M; i++) a[i] = 0;
clock_t end = clock();
cout<<(double) (end - begin) / CLOCKS_PER_SEC<<endl;
return 0;
}
Where -1- to -3- is the 3 cases tested. Each time I use one of them. On my computer, the average time of the first 2 cases is 0.04, and the 3rd case costs 0.08 (Each tested for 10 times). So I guess the initialization is done like memset?
But I'm still confused whether these two are the same. If so, who is doing this?
Please excuse me for my poor English.
Thanks for the comments! And I've read the references and the linked questions.
I've make it clear that the compiler does make the remainder elements to 0, but how is this done in lower level... maybe assembly? I'm sorry I didn't find references about that.
I'm new to stackoverflow, should I just edit my question like this?
Update:
Thanks a lot! I think I'm clear now.
The standard says when I int a[100] = {1};, the rest will be left 0. So my compiler will use some method to implement this, and different compiler may do this in different way. (Thank #eerorika to let me understand this!)
And thank #john for the site godbolt.org where I can read the assembly produced easily. Here I find out in x86-64 gcc 10.2(although in C but not C++), int a[100] = {1}; behaves like:
leaq -400(%rbp), %rdx
movl $0, %eax
movl $50, %ecx
movq %rdx, %rdi
rep stosq
movl $1, -400(%rbp)
movl $0, %eax
It use stosq to save 0's every quad-words for 50 times, which equals to 2 int to initialize the array.
And thank #largest_prime_is_463035818 to let me know my mistake in measuring time.
Thank all you for making me clear!
There is an answer to your question, but the C++ standard C++ latest draft - 9.4.2 Initializers - Aggregates (dcl.init.aggr) defines the elements explicitly initialized, and elements not explicitly initialized are copy initialized from an empty initializer list decl.init.aggr/5.2
I've been told that when we write int a[100] = {1};, the elements
after 1 will be initialized to 0.
You were told correctly. With int a[100] = {1};, a[0] is explicitly initialized from the initializer list, the remaining a[1-99] are copy-initialized from an empty list to 0.
But I didn't find out how this is done.
It is done by the compiler.
And is this true?
Yes, this is true.
but how is this done in lower level
It can be done in any way that the compiler designer had chosen.
If you are interested on what your compiler did, you can read the assembly that it produced.
I'm sorry I didn't find references about that.
There is no reference for that.
If there was similar questions please direct me there, I searched quiet some time but didn't find anything.
Backround:
I was just playing around and found some behavior I can't completely explain...
For primitive types, it looks like when there's an implicit conversion, the assignment operator = takes longer time, compared to an explicit assignment.
int iTest = 0;
long lMax = std::numeric_limits<long>::max();
for (int i=0; i< 100000; ++i)
{
// I had 3 such loops, each running 1 of the below lines.
iTest = lMax;
iTest = (int)lMax;
iTest = static_cast<int>(lMax);
}
The result is that the c style cast and c++ style static_cast performs the same on average (differs each time, but no visible difference). AND They both outperforms the implicit assignment.
Result:
iTest=-1, lMax=9223372036854775807
(iTest = lMax) used 276 microseconds
iTest=-1, lMax=9223372036854775807
(iTest = (int)lMax) used 191 microseconds
iTest=-1, lMax=9223372036854775807
(iTest = static_cast<int>(lMax)) used 187 microseconds
Question:
Why is the implicit conversion results in larger latency? I can guess it has to be detected in the assignment that int overflows, so adjusted to -1. But what exactly is going on in the assignment?
Thanks!
If you want to know why something is happening under the covers, the best place to look is ... wait for it ... under the covers :-)
That means examining the assembler language that is produced by your compiler.
A C++ environment is best thought of as an abstract machine for running C++ code. The standard (mostly) dictates behaviour rather than implementation details. Once you leave the bounds of the standard and start thinking about what happens underneath, the C++ source code is of little help anymore - you need to examine the actual code that the computer is running, the stuff output by the compiler (usually machine code).
It may be that the compiler is throwing away the loop because it's calculating the same thing every time so only needs do it once. It may be that it throws away the code altogether if it can determine you don't use the result.
There was a time many moons ago, when the VAX Fortran compiler (I did say many moons) outperformed its competitors by several orders of magnitude in a given benchmark.
That was for that exact reason. It had determined the results of the loop weren't used so had optimised the entire loop out of existence.
The other thing you might want to watch out for is the measuring tools themselves. When you're talking about durations of 1/10,000th of a second, your results can be swamped by the slightest bit of noise.
There are ways to alleviate these effects such as ensuring the thing you're measuring is substantial (over ten seconds for example), or using statistical methods to smooth out any noise.
But the bottom line is, it may be the measuring methodology causing the results you're seeing.
#include <limits>
int iTest = 0;
long lMax = std::numeric_limits<long>::max();
void foo1()
{
iTest = lMax;
}
void foo2()
{
iTest = (int)lMax;
}
void foo3()
{
iTest = static_cast<int>(lMax);
}
Compiled with GCC 5 using -O3 yields:
__Z4foo1v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
__Z4foo2v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
__Z4foo3v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
They are all exactly the same.
Since you didn't provide a complete example I can only guess that the difference is due to something you aren't showing us.
My question is a very basic one. In C or C++:
Let's say the for loop is as follows,
for(int i=0; i<someArray[a+b]; i++) {
....
do operations;
}
My question is whether the calculation a+b, is performed for each for loop or it is computed only once at the beginning of the loop?
For my requirements, the value a+b is constant. If a+b is computed and the value someArray[a+b]is accessed each time in the loop, I would use a temporary variable for someArray[a+b]to get better performance.
You can find out, when you look at the generated code
g++ -S file.cpp
and
g++ -O2 -S file.cpp
Look at the output file.s and compare the two versions. If someArray[a+b] can be reduced to a constant value for all loop cycles, the optimizer will usually do so and pull it out into a temporary variable or register.
It will behave as if it was computed each time. If the compiler is optimising and is capable of proving that the result does not change, it is allowed to move the computation out of the loop. Otherwise, it will be recomputed each time.
If you're certain the result is constant, and speed is important, use a variable to cache it.
is performed for each for loop or it is computed only once at the beginning of the loop?
If the compiler is not optimizing this code then it will be computed each time. Safer is to use a temporary variable it should not cost too much.
First, the C and C++ standards do not specify how an implementation must evaluate i<someArray[a+b], just that the result must be as if it were performed each iteration (provided the program conforms to other language requirements).
Second, any C and C++ implementation of modest quality will have the goal of avoiding repeated evaluation of expressions whose value does not change, unless optimization is disabled.
Third, several things can interfere with that goal, including:
If a, b, or someArray are declared with scope visible outside the function (e.g., are declared at file scope) and the code in the loop calls other functions, the C or C++ implementation may be unable to determine whether a, b, or someArray are altered during the loop.
If the address of a, b, or someArray or its elements is taken, the C or C++ implementation may be unable to determine whether that address is used to alter those objects. This includes the possibility that someArray is an array passed into the function, so its address is known to other entities outside the function.
If a, b, or the elements of someArray are volatile, the C or C++ implementation must assume they can be changed at any time.
Consider this code:
void foo(int *someArray, int *otherArray)
{
int a = 3, b = 4;
for(int i = 0; i < someArray[a+b]; i++)
{
… various operations …
otherArray[i] = something;
}
}
In this code, the C or C++ implementation generally cannot know whether otherArray points to the same array (or an overlapping part) as someArray. Therefore, it must assume that otherArray[i] = something; may change someArray[a+b].
Note that I have answered regarding the larger expression someArray[a+b] rather than just the part you asked about, a+b. If you are only concerned about a+b, then only the factors that affect a and b are relevant, obviously.
Depends on how good the compiler is, what optimization levels you use and how a and b are declared.
For example, if a and/or b has volatile qualifier then compiler has to read it/them everytime. In that case, compiler can't choose to optimize it with the value of a+b. Otherwise, look at the code generated by the compiler to understand what your compiler does.
There's no standard behaviour on how this is calculated in neither C not C++.
I will bet that if a and b do not change over the loop it is optimized. Moreover, if someArray[a+b] is not touched it is also optimized. This is actually more important since since fetching operations are quite expensive.
That is with any half-decent compiler with most basic optimizations. I will also go as far as saying that people who say it does always evaluate are plain wrong. It is not always for certain, and it is most probably optimized whenever possible.
The calculation is performed each for loop. Although the optimizer can be smart and optimize it out, you would be better off with something like this:
// C++ lets you create a const reference; you cannot do it in C, though
const some_array_type &last(someArray[a+b]);
for(int i=0; i<last; i++) {
...
}
It calculates every time. Use a variable :)
You can compile it and check the assembly code just to make sure.
But I think most compilers are clever enough to optimize this kind of stuff. (If you are using some optimization flag)
It might be calculated every time or it might be optimised. It will depends on whether a and b exist in a scope that the compiler can guarantee that no external function can change their values. That is, if they are in a global context, the compiler cannot guarantee that a function you call in the loop will modify them (unless you don't call any functions). If they are only in local context, then the compiler can attempt to optimise that calculation away.
Generating both optimised and unoptimised assembly code is the easiest way to check. However, the best thing to do is not care because the cost of that sum is so incredibly cheap. Modern processors are very very fast and the thing that is slow is pulling in data from RAM to the cache. If you want to optimised your code, profile it; don't guess.
The calculation a+b would be carried out every iteration of the loop, and then the lookup into someArray is carried out every iteration of the loop, so you could probably save a lot of processor time by having a temporary variable set outside the loop, for example(if the array is an array of ints say):
int myLoopVar = someArray[a+b]
for(int i=0; i<myLoopVar; i++)
{
....
do operations;
}
Very simplified explanation:
If the value at array position a+b were a mere 5 for example, that would be 5 calculations and 5 lookups, so 10 operations, which would be replaced by 8 by using a variable outside the loop (5 accesses (1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable) not so great a saving. If however you are dealing with larger values, for example the value stored in the array at a+b id 100, you would potentially be doing 100 calculations and 100 lookups, versus 103 operations if you have a variable outside the loop (100 accesses(1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable).
The majority of the above however is dependant on the compiler: depending upon which switches you utilise, what optimisations the compiler can apply automatically etc., the code may well be optimised without you having to do any changes to your code. Best thing to do is weigh up the pros and cons of each approach specifically for your current implementation, as what may suit a large number of iterations may not be most efficient for a small number, or perhaps memory may be an issue which would dictate a differing style to your program . . . Well you get the idea :)
Let me know if you need any more info:)
for the following code:
int a = 10, b = 10;
for(int i=0; i< (a+b); i++) {} // a and b do not change in the body of loop
you get the following assembly:
L3:
addl $1, 12(%esp) ;increment i
L2:
movl 4(%esp), %eax ;move a to reg AX
movl 8(%esp), %edx ;move b to reg BX
addl %edx, %eax ;AX = AX + BX, i.e. AX = a + b
cmpl 12(%esp), %eax ;compare AX with i
jg L3 ;if AX > i, jump to L3 label
if you apply the compiler optimization, you get the following assembly:
movl $20, %eax ;move 20 (a+b) to AX
L3:
subl $1, %eax ;decrement AX
jne L3 ;jump if condition met
movl $0, %eax ;move 0 to AX
basically, in this case, with my compiler (MinGW 4.8.0), the loop will do "the calculation" regardless of whether you're changing the conditional variables within the loop or not (haven't posted assembly for this, but take my word for it, or even better, don't and disassemble the code yourself).
when you apply the optimization, the compiler will do some magic and churn out a set of instructions that are completely unrecognizable.
if you dont feel like optimizing your loop through a compiler action (-On), then declaring one variable and assigning it a+b will reduce your assembly by an instruction or two.
int a = 10, b = 10;
const int c = a + b;
for(int i=0; i< c; i++) {}
assembly:
L3:
addl $1, 12(%esp)
L2:
movl 12(%esp), %eax
cmpl (%esp), %eax
jl L3
movl $0, %eax
keep in mind, the assembly code i posted here is only the relevant snippet, there's a bit more, but it's not relevant as far as the question goes
I have the following line in a function to count the number of 'G' and 'C' in a sequence:
count += (seq[i] == 'G' || seq[i] == 'C');
Are compilers smart enough to do nothing when they see 'count += 0' or do they actually lose time 'adding' 0 ?
Generally
x += y;
is faster than
if (y != 0) { x += y; }
Even if y = 0, because there is no branch in the first option. If its really important, you'll have to check the compiler output, but don't assume your way is faster because it sometimes doesn't do an add.
Honestly, who cares??
[Edit:] This actually turned out somewhat interesting. Contrary to my initial assumption, in unoptimized compilation, n += b; is better than n += b ? 1 : 0;. However, with optimizations, the two are identical. More importantly, though, the optimized version of this form is always better than if (b) ++n;, which always creates a cmp/je instruction. [/Edit]
If you're terribly curious, put the following code through your compiler and compare the resulting assembly! (Be sure to test various optimization settings.)
int n;
void f(bool b) { n += b ? 1 : 0; }
void g(bool b) { if (b) ++n; }
I tested it with GCC 4.6.1: With g++ and with no optimization, g() is shorter. With -O3, however, f() is shorter:
g(): f():
cmpb $0, 4(%esp) movzbl 4(%esp), %eax
je .L1 addl %eax, n
addl $1, n
.L1:
rep
Note that the optimization for f() actually does what you wrote originally: It literally adds the value of the conditional to n. This is in C++ of course. It'd be interesting to see what the C compiler would do, absent a bool type.
Another note, since you tagged this C as well: In C, if you don't use bools (from <stdbool.h>) but rather ints, then the advantage of one version over the other disappears, since both now have to do some sort of testing.
It depends on your compiler, its optimization options that you used and its optimization heuristics. Also, on some architectures it may be faster to add than to perform a conditional jump to avoid the addition of 0.
Compilers will NOT optimize away the +0 unless the expression on the right is a compiler const value equaling zero. But adding zero is much faster on all modern processors than branching (if then) to attempt to avoid the add. So the compiler ends up doing the smartest thing available in the given situation- simply adding 0.
Some are and some are not smart enough. its highly dependent on the optimizer implementation.
The optimizer might also determine that if is slower than + so it will still do the addition.
Consider an example like this:
if (flag)
for (condition)
do_something();
else
for (condition)
do_something_else();
If flag doesn't change inside the for loops, this should be semantically equivalent to:
for (condition)
if (flag)
do_something();
else
do_something_else();
Only in the first case, the code might be much longer (e.g. if several for loops are used or if do_something() is a code block that is mostly identical to do_something_else()), while in the second case, the flag gets checked many times.
I'm curious whether current C++ compilers (most importantly, g++) would be able to optimize the second example to get rid of the repeated tests inside the for loop. If so, under what conditions is this possible?
Yes, if it is determined that flag doesn't change and can't be changed by do_something or do_something_else, it can be pulled outside the loop. I've heard of this called loop hoisting, but Wikipedia has an entry called "loop invariant code motion".
If flags is a local variable, the compiler should be able to do this optimization since it's guaranteed to have no effect on the behavior of the generated code.
If flags is a global variable, and you call functions inside your loop it might not perform the optimization - it may not be able to determine if those functions modify the global.
This can also be affected by the sort of optimization you do - optimizing for size would favor the non-hoisted version while optimizing for speed would probably favor the hoisted version.
In general, this isn't the sort of thing that you should worry about, unless profiling tells you that the function is a hotspot and you see that less than efficient code is actually being generated by going over the assembly the compiler outputs. Micro-optimizations like this you should always just leave to the compiler unless you absolutely have to.
Tried with GCC and -O3:
void foo();
void bar();
int main()
{
bool doesnt_change = true;
for (int i = 0; i != 3; ++i) {
if (doesnt_change) {
foo();
}
else {
bar();
}
}
}
Result for main:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
call __Z3foov
call __Z3foov
call __Z3foov
xorl %eax, %eax
leave
ret
So it does optimize away the choice (and unrolls smaller loops).
This optimization is not done if doesnt_change is global.
I'm sure if the compiler can determine that the flag will remain constant, it can do some shufflling:
const bool flag = /* ... */;
for (..;..;..;)
{
if (flag)
{
// ...
}
else
{
// ...
}
}
If the flag is not const, the compiler cannot necessarily optimize the loop, because it can't be sure flag won't change. It can if it does static analysis, but not all compilers do, I think. const is the sure-fire way of telling the compiler the flag won't change, after that it's up to the compiler.
As usual, profile and find out if it's really a problem.
I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread?
That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.
As many have said: it depends.
If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this:
for (condition)
do_it<flag>();
Generally, yes. But there is no guarantee, and the places where the compiler will do it are probably rare.
What most compilers do without a problem is hoisting immutable evaluations out of the loop, e.g. if your condition is
if (a<b) ....
when a and b are not affected by the loop, the comparison will be made once before the loop.
This means if the compiler can determine the condition does not change, the test is cheap and the jump wenll predicted. This in turn means the test itself costs one cycle or no cycle at all (really).
In which cases splitting the loop would be beneficial?
a) a very tight loop where the 1 cycle is a significant cost
b) the entire loop with both parts does not fit the code cache
Now, the compiler can only make assumptions about the code cache, and usually can order the code in a way that one branch will fit the cache.
Without any testing, I'dexpect a) the only case where such an optimization would be applied, becasue it's nto always the better choice:
In which cases splitting the loop would be bad?
When splitting the loop increases code size beyond the code cache, you will take a significant hit. Now, that only affects you if the loop itself is called within another loop, but that's something the compiler usually can't determine.
[edit]
I couldn't get VC9 to split the following loop (one of the few cases where it might actually be beneficial)
extern volatile int vflag = 0;
int foo(int count)
{
int sum = 0;
int flag = vflag;
for(int i=0; i<count; ++i)
{
if (flag)
sum += i;
else
sum -= i;
}
return sum;
}
[edit 2]
note that with int flag = true; the second branch does get optimized away. (and no, const doesn't make a difference here ;))
What does that mean? Either it doesn't support that, it doesn't matter, ro my analysis is wrong ;-)
Generally, I'd asume this is an optimization that is valuable only in a very few cases, and can be done by hand easily in most scenarios.
It's called a loop invariant and the optimization is called loop invariant code motion and also code hoisting. The fact that it's in a conditional will definitely make the code analysis more complex and the compiler may or may not invert the loop and the conditional depending on how clever the optimizer is.
There is a general answer for any specific case of this kind of question, and that's to compile your program and look at the generated code.