I have found that this code causes a startling error in the gnu C++ compiler when it is optimizing.
#include <stdio.h>
int main()
{
int a = 333666999, b = 0;
for (short i = 0; i<7; ++i)
{
b += a;
printf("%d ", b);
}
return 9;
}
To compile using g++ -Os fail.cpp the executable does not print seven numbers, it goes on forever, printing and printing. I am using -
-rwxr-xr-x 4 root root 700388 Jun 3 2013 /usr/bin/g++
Is there a later corrected version?
The compiler is very, very rarely wrong. In this case, b is overflowing, which is undefined behaviour for signed integers:
$ g++ --version
g++ (GCC) 10.2.0
...
$ g++ -Os -otest test.cpp
test.cpp: In function ‘int main()’:
test.cpp:8:11: warning: iteration 6 invokes undefined behavior [-Waggressive-loop-optimizations]
8 | b += a;
| ~~^~~~
test.cpp:6:24: note: within this loop
6 | for (short i = 0; i<7; ++i)
| ~^~
And if you invoke undefined behaviour, the compiler is free to do whatever it likes, including making your program never terminate.
Edit: Some people seem to think that the UB should only affect the value of b, but not the loop iteration. This is not according to the Standard (UB can cause literally anything to happen) but it's a reasonable thought, so let's look at the generated assembly to see why the loop doesn't terminate.
First without -Os:
.LC0:
.string "%d "
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-12], 333666999
mov DWORD PTR [rbp-4], 0
mov WORD PTR [rbp-6], 0
.L3:
cmp WORD PTR [rbp-6], 6 # Compare i to 6
jg .L2 # If greater, jump to end
mov eax, DWORD PTR [rbp-12]
add DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
movzx eax, WORD PTR [rbp-6]
add eax, 1
mov WORD PTR [rbp-6], ax
jmp .L3
.L2:
mov eax, 9
leave
ret
Then with -Os:
.LC0:
.string "%d "
main:
push rbx
xor ebx, ebx
.L2:
add ebx, 333666999
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov esi, ebx
call printf
jmp .L2
The comparison and jump instructions are completely gone. Ironically, the compiler did exactly what you asked it to do: optimize for size, so remove as many instructions as it can while obeying the C++ standard. -O3 and -O2 generate the exact same code as -Os here.
-O1 generates a very interesting output:
.LC0:
.string "%d "
main:
push rbx
mov ebx, 0
.L2:
add ebx, 333666999
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp ebx, -1959298303
jne .L2
mov eax, 9
pop rbx
ret
Here, the compiler optimized away the loop counter i and just compares the value of b to its final value after 7 iterations, using the fact that signed overflow happens according to two's complement on this platform! Cheeky, isn't it? :)
I am using g++ version 4.8.1. Thomas has version 10.2.0 which evidently puts out a warning about "undefined behavior" when adding two signed integers. However, being only a warning still goes ahead and compiles the program. In all circumstances though, the "undefined behavior" should only be concerning the integers being added. In practice those integers in fact do abide by the 2's complement expected result. The "undefined behavior" should not overwrite other variables in the program. Otherwise the executable cannot be trusted at all. And if it cannot be trusted it shouldn't be compiled. Perhaps there is an even later version of the gnu compiler that works correctly when optimizing?
Related
#include <cstdio>
__int128 idx;
int main() {
int a[2] = {1, 2};
idx++;
a[idx] = 0;
printf("%d %d", a[0], a[1]);
}
After turning on O2 a[idx] = 0 not executed.
I guess it shouldn't be undefined behavior.
Is this a bug in the compiler?
https://godbolt.org/z/qqccd9oEj
Looking at the compiler output for gcc-12.1 -std=c++20 -O2 -W -Wall
.LC0:
.string "%d %d"
main:
sub rsp, 8
mov edx, 2
add QWORD PTR idx[rip], 1
mov esi, 1
adc QWORD PTR idx[rip+8], 0
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
idx:
.zero 16
The problem is mov edx, 2. That is just wrong, it should read a[1] and optimize that to 0 not 2.
clang gets it right but still generates horrible code. idx should get optimized out.
You should file that as compiler bug.
I wrote this program as a test case for the behavior of bit field member comparisons in C++ (I suppose the same behavior would be exhibited in C as well):
#include <cstdint>
#include <cstdio>
union Foo
{
int8_t bar;
struct
{
#if __BYTE_ORDER == __LITTLE_ENDIAN
int8_t baz : 1;
int8_t quux : 7;
#elif __BYTE_ORDER == __BIG_ENDIAN
int8_t quux : 7;
int8_t baz : 1;
#endif
};
};
int main()
{
Foo foo;
scanf("%d", &foo.bar);
if (foo.baz == 1)
printf("foo.baz == 1\n");
else
printf("foo.baz != 1\n");
}
After I compile and run it with 1 as its input, I get the following output:
foo.baz != 1
*** stack smashing detected ***: terminated
fish: “./a.out” terminated by signal SIGABRT (Abort)
One would expect that the foo.baz == 1 check would be evaluated as true since baz is always the least significant bit in the anonymous bit field. However, the opposite seems to happen, as can be seen from the program output (which is, somewhat comfortingly, consistently the same across each program invocation).
Even more weird to me is the fact that the generated AMD64 assembly code for the program (using the GCC 10.2 compiler) does not contain even a single comparison or jump instruction!
.LC0:
.string "%d"
.LC1:
.string "foo.baz != 1"
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-1]
mov rsi, rax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call scanf
mov edi, OFFSET FLAT:.LC1
call puts
mov eax, 0
leave
ret
It seems that the C++ code for the if statement somehow gets optimized out (or something like that), even though I compiled the program with the default settings (i.e. I did not turn on any level of optimization or anything like that).
Interestingly enough, Clang 10.0.1 (when run without optimizations) seems to generate code with a cmp instruction (as well as a jne and a jmp one):
main: # #main
push rbp
mov rbp, rsp
sub rsp, 16
mov dword ptr [rbp - 4], 0
lea rax, [rbp - 8]
movabs rdi, offset .L.str
mov rsi, rax
mov al, 0
call scanf
mov cl, byte ptr [rbp - 8]
shl cl, 7
sar cl, 7
movsx edx, cl
cmp edx, 1
jne .LBB0_2
movabs rdi, offset .L.str.1
mov al, 0
call printf
jmp .LBB0_3
.LBB0_2:
movabs rdi, offset .L.str.2
mov al, 0
call printf
.LBB0_3:
mov eax, dword ptr [rbp - 4]
add rsp, 16
pop rbp
ret
.L.str:
.asciz "%d"
.L.str.1:
.asciz "foo.baz == 1\n"
.L.str.2:
.asciz "foo.baz != 1\n"
Both of the printf strings also seem to be present in the data segment (unlike in the GCC case when only the second one is present). I cannot tell for sure (because I'm not very proficient in assembly) but this seems to be properly generated code (unlike the one which GCC generates).
However, as soon as I try compile with any kind of optimizations (even -O1) using Clang, the comparisons/jumps are gone (as well as the foo.baz == 1 string), and the generated code seems to be very similar to the one which GCC generates:
(with -O1)
main: # #main
push rax
mov rsi, rsp
mov edi, offset .L.str
xor eax, eax
call scanf
mov edi, offset .Lstr
call puts
xor eax, eax
pop rcx
ret
.L.str:
.asciz "%d"
.Lstr:
.asciz "foo.baz != 1"
(You may want to check the generated assembly code by different compiler versions yourself using Compiler Explorer.)
I'm totally perplexed by this kind of unintuitive behavior. The only thing which comes to mind as an explanation is the interaction of some weird undefined behavior of bitfields containing signed integral types and unions. What makes me think so is that after I replace the signed integer types with their unsigned counterparts, the output of the program becomes exactly as one would expect (with 1 as input):
foo.baz == 1
*** stack smashing detected ***: terminated
fish: “./a.out” terminated by signal SIGABRT (Abort)
Naturally, the program crashing because of a stack smashing (just like before) is something which is not supposed to happen, which leads to my second question: why does this occur?
Here's the modified program:
#include <cstdint>
#include <cstdio>
union Foo
{
uint8_t bar;
struct
{
#if __BYTE_ORDER == __LITTLE_ENDIAN
uint8_t baz : 1;
uint8_t quux : 7;
#elif __BYTE_ORDER == __BIG_ENDIAN
uint8_t quux : 7;
uint8_t baz : 1;
#endif
};
};
int main()
{
Foo foo;
scanf("%d", &foo.bar);
if (foo.baz == 1)
printf("foo.baz == 1\n");
else
printf("foo.baz != 1\n");
}
... and the generated assembly code by GCC:
.LC0:
.string "%d"
.LC1:
.string "foo.baz == 1"
.LC2:
.string "foo.baz != 1"
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-1]
mov rsi, rax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call scanf
movzx eax, BYTE PTR [rbp-1]
and eax, 1
test al, al
je .L2
mov edi, OFFSET FLAT:.LC1
call puts
jmp .L3
.L2:
mov edi, OFFSET FLAT:.LC2
call puts
.L3:
mov eax, 0
leave
ret
The stack smashing has nothing to do with member access.
scanf("%d", &foo.bar);
The %d format conversion specifier is for an int. Which is, typically, 4 bytes. But your bar is:
int8_t bar;
just one byte.
So, scanf ends up writing a 4 bytes worth of an int value into a one byte bar, and clobbering three additional bytes in the immediate vicinity.
There's your stack smash.
The answer is trivial.
your baz struct member is 1 bit long and it is signed. So it will never be 1. The only possibe values are 0 and -1.
Compiler knows that so the condition foo.baz == 1 will never be the truth. No conditional code has to be generated.
So I afraid it is not the compiler bug, only the programmer bug :)
So if we change the code to:
int main()
{
union Foo foo;
int x;
scanf("%d", &x);
foo.bar = x;
if (foo.baz == -1)
printf("foo.baz == -1\n");
else
printf("foo.baz != -1\n");
}
Compiler starts to generate the conditional instructions.
https://godbolt.org/z/fzKMo5
BTW your endianess check does not make any sense here as endianess defines the byte order not the bit order
Not related to the code generation problem is use of the wrong scanf conversion specifier.
I'm trying to call x64 assembly function from C++ code with four parameters and the assembly function reset the first parameter to zero every time. Please find the code snippet below.
C++ code: test.cpp
#include <iostream>
extern "C" int IntegerShift_(unsigned int a, unsigned int* a_shl, unsigned int* a_shr, unsigned int count);
int main(int argc, char const *argv[])
{
unsigned int a = 3119, count = 6, a_shl, a_shr;
std::cout << "a value before calling " << a << std::endl;
IntegerShift_(a, &a_shl, &a_shr, count);
std::cout << "a value after calling " << a << std::endl;
return 0;
}
x64 assembly code: test.asm
section .data
section .bss
section .text
global IntegerShift_
IntegerShift_:
;prologue
push rbp
mov rbp, rsp
mov rax, rdi
shl rax, cl
mov [rsi], rax
mov rax, rdi
shr rax, cl
mov [rdx], rax
xor rax,rax
;epilogue
mov rbp, rsp
pop rbp
ret
I'm working on the below environment.
OS - Ubuntu 18.04 64-bit
Assembler - nasm (2.13.02)
C++ compiler - g++ (7.4.0)
processor - Intel® Pentium(R) CPU G3240 # 3.10GHz × 2
and I'm compiling my code as below
$ nasm -f elf64 -g -F dwarf test.asm
$ g++ -g -o test test.cpp test.o
$ ./test
$ a value before calling 3119
$ a value after calling 0
But if i comment out the line mov [rdx], rax from assembly function, its not resetting the value of variable a. I'm new to x64 assembly programming and I couldn't find the relation between rdx register and variable a.
unsigned int* a_shl, unsigned int* a_shr are pointers to unsigned int, a 32-bit (dword) type.
You do two qword stores, mov [rsi], rax and mov [rdx], rax which store outside of the pointed-to objects.
The C equivalent would be a function that takes unsigned int* args and does
*(unsigned long)a_shr = a>>count;. This is of course UB, and behaviour like this (overwriting other variables) is pretty much what you'd expect.
Presumably you compiled with optimization disabled so the caller actually reloaded a from the stack. And it put a_shr or a_shl next to a in its stack frame, and one of your stores zeroed your caller's copy of a.
(As usual, gcc happened to zero the upper 32 bits of RDI while it put a into EDI as the first arg. Writing a 32-bit register zero-extends to the full register. So your other bug; right shifting high garbage into the low 32 bits for a_shr, didn't bite you with this caller.)
Simpler implementation:
global IntegerShift ; why the trailing underscore? That's weird for no reason.
IntegerShift:
;prologue not needed, we don't even use the stack
; so don't waste instructions making a frame pointer.
mov eax, edi
shl rax, cl ; a<<count
mov [rsi], eax ; 32-bit store
;mov rax, rdi ; we can just destroy our local a, we're done with it
shr edi, cl ; a>>count
mov [rdx], edi ; 32-bit store
xor eax, eax ; return 0
ret
xor eax, eax is the most efficient way to zero a 64-bit register (no wasted REX prefix). And your return value is only 32-bit anyway because you declared it int, so it makes no sense to be using 64-bit registers.
BTW, if you had BMI2 available (which you don't on your budget Pentium CPU, unfortunately), you could avoid all the register copying, and be more efficient on Intel CPUs (SHL/RX is only 1 uop instead of 3 for shl/r reg, cl because of legacy x86 FLAGS-unmodified semantics for the cl=0 case)
shlx eax, edi, ecx
shrx edi, edi, ecx
mov [rsi], eax
mov [rdx], edi
xor eax, eax
ret
Firstly: This code is considered to be of pure fun, please do not do anything like this in production. We will not be responsible of any harm caused to you, your company or your reindeer after compiling and executing this piece of code in any environment. The code below is not safe, not portable and is plainly dangerous. Be warned. Long post below. You were warned.
Now, after the disclaimer: Let's consider the following piece of code:
#include <stdio.h>
int fun()
{
return 5;
}
typedef int(*F)(void) ;
int main(int argc, char const *argv[])
{
void *ptr = &&hi;
F f = (F)ptr;
int c = f();
printf("TT: %d\n", c);
if(c == 5) goto bye;
//else goto bye; /* <---- This is the most important line. Pay attention to it */
hi:
c = 5;
asm volatile ("movl $5, %eax");
asm volatile ("retq");
bye:
return 66;
}
For the beginning we have the function fun which I have created purely for reference to get the generated assembly code.
Then we declare a function pointer F to functions taking no parameters and returning an int.
Then we use the not so well known GCC extension https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html to get the address of a label hi, and this works in clang too. Then we do something evil, we create a function pointer F called f and initialize it to be the label above.
Then the worst of all, we actually call this function, and assign its return value to a local variable, called C and the we print it out.
The following is an if to check if the value assigned to the c is actually the one we need, and if yes go to bye so that he application exits normally, with exit code 66. If that can be considered a normal exit code.
The next line is commented out, but I can say this is the most important line in the entire application.
The piece of code after the label hi is to assign 5 to the value of c, then two lines of assembly to initialize the value of eax to 5 and to actually return from the "function" call. As mentioned, there is a reference function, fun which generates the same code.
And now we compile this application, and run it on our online platform: https://gcc.godbolt.org/z/K6z5Yc
It generates the following assembly (with -O1 turned on, and O0 gives a similar result, albeit a bit more longer):
# else goto bye is COMMENTED OUT
fun:
mov eax, 5
ret
.LC0:
.string "TT: %d\n"
main:
push rbx
mov eax, OFFSET FLAT:.L3
call rax
mov ebx, eax
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp ebx, 5
je .L4
.L3:
movl $5, %eax
retq
.L4:
mov eax, 66
pop rbx
ret
The important lines are mov eax, OFFSET FLAT:.L3 where the L3 corresponds to our hi label, and the line after that: call rax which actually calls it.
And runs like:
ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 66
TT: 5
Now, let's revisit the most important line in the application and uncomment it.
With -O0 we get the following assembly, generated by gcc:
# else goto bye is UNCOMMENTED
# even gcc -O0 "knows" hi: is unreachable.
fun:
push rbp
mov rbp, rsp
mov eax, 5
pop rbp
ret
.LC0:
.string "TT: %d\n"
main:
push rbp
mov rbp, rsp
sub rsp, 48
mov DWORD PTR [rbp-36], edi
mov QWORD PTR [rbp-48], rsi
mov QWORD PTR [rbp-8], OFFSET FLAT:.L4
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rbp-16], rax
mov rax, QWORD PTR [rbp-16]
call rax
mov DWORD PTR [rbp-20], eax
mov eax, DWORD PTR [rbp-20]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
cmp DWORD PTR [rbp-20], 5
nop
.L4:
mov eax, 66
leave
ret
and the following output:
ASM generation compiler returned: 0
Execution build compiler returned: 0
Program returned: 66
so, as you can see our printf was never called, the culprit is the line mov QWORD PTR [rbp-8], OFFSET FLAT:.L4 where L4 actually corresponds to our bye label.
And from what I can see from the generated assembly, not a piece of code from the part after hi was added into the generated code.
But at least the application runs and at least has some code for comparing c to 5.
On the other end, clang, with O0 generates the following nightmare, which by the way crashes:
# else goto bye is UNCOMMENTED
# clang -O0 also doesn't emit any instructions for the hi: block
fun: # #fun
push rbp
mov rbp, rsp
mov eax, 5
pop rbp
ret
main: # #main
push rbp
mov rbp, rsp
sub rsp, 48
mov dword ptr [rbp - 4], 0
mov dword ptr [rbp - 8], edi
mov qword ptr [rbp - 16], rsi
mov qword ptr [rbp - 24], 1
mov rax, qword ptr [rbp - 24]
mov qword ptr [rbp - 32], rax
call qword ptr [rbp - 32]
mov dword ptr [rbp - 36], eax
mov esi, dword ptr [rbp - 36]
movabs rdi, offset .L.str
mov al, 0
call printf
cmp dword ptr [rbp - 36], 5
jne .LBB1_2
jmp .LBB1_3
.LBB1_2:
jmp .LBB1_3
.LBB1_3:
mov eax, 66
add rsp, 48
pop rbp
ret
.L.str:
.asciz "TT: %d\n"
If we turn on some optimization, for example O1, we get from gcc:
# else goto bye is UNCOMMENTED
# gcc -O1
fun:
mov eax, 5
ret
.LC0:
.string "TT: %d\n"
main:
sub rsp, 8
mov eax, OFFSET FLAT:.L3
call rax
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
.L3:
mov eax, 66
add rsp, 8
ret
and the application crashes, which is sort of understandable. Again, the compiler had entirely removed our hi section (mov eax, OFFSET FLAT:.L3 goes tiptoe to L3 which corresponds to our bye section) and unfortunately decided that it's a good idea to increase rsp before a ret so to be sure we end up somewhere totally different where we need to be.
And clang delivers something even more dubious:
# else goto bye is UNCOMMENTED
# clang -O1
fun: # #fun
mov eax, 5
ret
main: # #main
push rax
mov eax, 1
call rax
mov edi, offset .L.str
mov esi, eax
xor eax, eax
call printf
mov eax, 66
pop rcx
ret
.L.str:
.asciz "TT: %d\n"
1 ? How on earth did clang end up with this?
To some level I understand that the compiler decided that dead code after an if where both if and else go to the same location is not needed, but here my knowledge and insight stops.
So now, dear C and C++ gurus, assembly aficionados and compiler crushers, here comes the question:
Why?
Why do you think did the compiler decide that the two labels should be considered equivalent if we have added the else branch, or why did clang put there 1, and last but not least: someone with a deep understanding of the C standard could maybe point out where this piece of code deviated so badly from normality that we ended up in this really really weird situation.
someone with a deep understanding of the C standard could maybe point out where this piece of code deviated so badly from normality that we ended up in this really really weird situation.
You think the ISO C standard has anything to say about this code? It's chock full of UB and GNU extensions, notably pointers to local labels.
Casting a label pointer to a function pointer and calling through it is obviously UB. The GCC manual doesn't say you can do that. It's also UB to goto a label in another function.
You were only able to make that work by tricking the compiler into thinking that block might be reached so it's not removed, then using GNU C Basic asm statements to emit a ret instruction there.
GCC and clang remove dead code even with optimization disabled; e.g. if(0) { ... } doesn't emit any instructions to implement the ...
Also note that the c=5 in hi: compiles with optimization fully disabled (and else goto bye commented) to asm like movl $5, -20(%rbp). i.e. using the caller's RBP to modify local variables in the stack frame of the caller. So you have a nested function.
GNU C allows you to define nested functions that can access the local vars of their parent scope. (If you liked the asm you got from your experiment, you'll love the executable trampoline of machine-code that GCC stores to the stack with mov-immediate if you take a pointer to a nested function!)
asm volatile ("movl $5, %eax"); is missing a clobber on EAX. You step on the compiler's toes which would be UB if this statement was ever reached normally, rather than as if it were a separate function.
The use-case for GNU C Basic asm (no constraints / clobbers) is instructions like cli (disable interrupts), not anything involving integer registers, and definitely not ret.
If you want to define a callable function using inline asm, you can use asm("") at global scope, or as the body of an __attribute__((naked)) function.
I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.
I'm surprise with this two tests and how much change with the different compiler flags optimizations.
Test iterator 1:
Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.
Test iterator 2:
Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.
P.D.: I use the following clock code.
Test iterator 1:
void test_iterator_1()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
size_t count = v.size();
for (unsigned int i = 0; i < count; ++i) {
v[i] = 1;
}
}
Test iterator 2:
void test_iterator_2()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
for (int& i : v) {
i = 1;
}
}
UPDATE: The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best.
UPDATE 2: Results:
With -O3:
t1: 80 units
t2: 74 units
With -O0:
t1: 287 units
t2: 538 units
UPDATE 3: The CODE!. Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)
Your first test is actually setting the value of each element in the vector to 1.
Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).
When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.
If you want the second loop to actually set the value:
for (int& i : v) // notice the &
{
i = 1;
}
Once you make that change, your loops are likely to produce assembly code that is almost identical.
As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:
std::vector<int> v(SIZE, 1);
EDIT
The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:
Version 1 will store a value for count and increment i, testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...)). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).
Version 1 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
push eax
lea ecx, DWORD PTR _v$[ebp]
call ??A?$vector#HV?$allocator#H#std###std##QAEAAHI#Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1
Version 2 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1
When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.
Version 1 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov ecx, DWORD PTR _v$[ebp+4]
mov edx, DWORD PTR _v$[ebp]
sub ecx, edx
sar ecx, 2 ; this is the only differing instruction
test ecx, ecx
je SHORT $LN3#test_itera
push edi
mov eax, 1
mov edi, edx
rep stosd
pop edi
$LN3#test_itera:
test edx, edx
je SHORT $LN21#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN21#test_itera:
mov esp, ebp
pop ebp
ret 0
Version 2 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov edx, DWORD PTR _v$[ebp]
mov ecx, DWORD PTR _v$[ebp+4]
mov eax, edx
cmp edx, ecx
je SHORT $LN1#test_itera
$LL33#test_itera:
mov DWORD PTR [eax], 1
add eax, 4
cmp eax, ecx
jne SHORT $LL33#test_itera
$LN1#test_itera:
test edx, edx
je SHORT $LN47#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN47#test_itera:
mov esp, ebp
pop ebp
ret 0
Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.
Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.
Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler, suddenly the effect of your optimization vanishes.
So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.
In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).