g++ generates an assembly code without push and pop

g++ generates an assembly code without push and pop - c++

I figured out that the g++ compiler generates an assembly code hardly without any push/pop instructions. It only uses those when getting in/out a function. Everytime it emplaces bytes in the stack, it makes 2 or 3 instruction, for example:
movl foo, %eax
subl $4, %esp
movl %eax, (%esp)`
Insted of just pushl foo. Is there a reason for that? Is it faster or something?
Thank you.

Related

do constant calculations in #define consume resources?

as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources, but that is a different story when we use definitions as macros and put some calculations inside it. take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code, does it spend time for CPU to calculate the SQRT and add operations or this values are calculated in compiler and just replaced in the code?
If it is calculated by CPU, is there anyway to be calculated beforehand in preprocessor and assigned as a constant here?

as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources,
It depends on what you mean by that. Constants appearing in expressions that potentially are evaluated at runtime have to be represented in the program somehow. Under most circumstances, that will take some space.
Even (in C++) a constexpr can be evaluated at runtime, even though implementations can and probably do evaluate them at compile time.
but that is a different story when we use definitions as macros and put some calculations inside it.
Yes, it is different, because macros are not constants in any applicable sense of that term.
take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code,
d is not a variable. It is a macro. Wherever it appears within the scope of those macro definitions, it is exactly equivalent to the expression sqrt(12.2*5.8) appearing at that point.
does it spend time for CPU
to calculate the SQRT and add operations or this values are calculated
in compiler and just replaced in the code?
Either could happen. It depends on your compiler, probably on compilation options, and possibly on other code in the translation unit. Pre-computation is more likely at higher optimization levels.
If it is calculated by CPU,
is there anyway to be calculated beforehand in preprocessor and
assigned as a constant here?
Such a calculation is not part of the semantics of the preprocessor per se. To the somewhat artificial extent that we draw a distinction between preprocessor and compiler in modern C and C++ implementations, if a precomputation is performed then it will be performed by the compiler, not the preprocessor.
The C and C++ languages do not define a mechanism to force such evaluations to be performed at compile time, but you can make it more likely by increasing the compiler's optimization level. Or in C++, there is probably a way to use a template to wrap the expression in a constexpr function computing its value, which would make it very likely to be computed at compile time.
But if a constant is what you want, then you always have the option of pre-computing it manually, and defining the macro to expand to an actual constant.

The C standard doesn't specify whether the calculations required when using d (i.e. sqrt(12.2 * 5.8) is done at compile time or at run time. It's left to the individual compiler to decide.
Example:
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
float foo() {
return d;
}
int main(void)
{
printf("%f\n", foo());
}
may result in (using https://godbolt.org/ and gcc 10.2 -O0)
foo:
pushq %rbp
movq %rsp, %rbp
movss .LC0(%rip), %xmm0
popq %rbp
ret
.LC1:
.string "%f\n"
main:
pushq %rbp
movq %rsp, %rbp
movl $0, %eax
call foo
pxor %xmm1, %xmm1
cvtss2sd %xmm0, %xmm1
movq %xmm1, %rax
movq %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
popq %rbp
ret
.LC0:
.long 1090950945
and for -O2
foo:
movss .LC0(%rip), %xmm0
ret
.LC2:
.string "%f\n"
main:
subq $8, %rsp
movl $.LC2, %edi
movl $1, %eax
movsd .LC1(%rip), %xmm0
call printf
xorl %eax, %eax
addq $8, %rsp
ret
.LC0:
.long 1090950945
.LC1:
.long 536870912
.long 1075892964
so in this example we see a compile time calculation. In my experience all the major compilers will do that but it's not a requirement made by the C standard.

To understand this, you need to understand how #define works! #define is not a statement which the compiler even receives. It is a preprocessor directive, which means, it is processed by the preprocessor. #define a b literally replaces all occurrences of a with b. So, as many times as you call sqrt(c), that many times, it will be replaced by sqrt(a*b). Now, the standards do not mention whether something like this will be calculated at runtime or at compile time, and it has been left to the individual compiler to decide. Although, usage of optimisation flags will definitely impact the end result. Not to mention, neither of those variables will be type safe!

Remove needless assembler statements from g++ output

I am investigating some problem with a local binary. I've noticed that g++ creates a lot of ASM output that seems unnecessary to me. Example with -O0:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp <--- just need 8 bytes for the movq to -8(%rbp), why -16?
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi <--- now we have moved rdi onto itself.
call Base::Base()
leaq 16+vtable for Derived(%rip), %rdx
movq -8(%rbp), %rax <--- effectively %edi, does not point into this area of the stack
movq %rdx, (%rax) <--- thus this wont change -8(%rbp)
movq -8(%rbp), %rax <--- so this statement is unnecessary
movl $4712, 12(%rax)
nop
leave
ret
option -O1 -fno-inline -fno-elide-constructors -fno-omit-frame-pointer:
Derived::Derived():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp <--- reserve some stack space and never use it.
movq %rdi, %rbx
call Base::Base()
leaq 16+vtable for Derived(%rip), %rax
movq %rax, (%rbx)
movl $4712, 12(%rbx)
addq $8, %rsp <--- release unused stack space.
popq %rbx
popq %rbp
ret
This code is for the constructor of Derived that calls the Base base constructor and then overrides the vtable pointer at position 0 and sets a constant value to an int member it holds in addition to what Base contains.
Question:
Can I translate my program with as few optimizations as possible and get rid of such stuff? Which options would I have to set? Or is there a reason the compiler cannot detect these cases with -O0 or -O1 and there is no way around them?
Why is the subq $8, %rsp statement generated at all? You cannot optimize in or out a statement that makes no sense to begin with. Why does the compiler generate it then? The register allocation algorithm should never, even with O0, generate code for something that is not there. So why it is done?

is there a reason the compiler cannot detect these cases with -O0 or -O1
exactly because you're telling the compiler not to. These are optimisation levels that need to be turn off or down for proper debugging. You're also trading off compilation time for run-time.
You're looking through the telescope the wrong way, check out the awesome optimisations that you're compiler will do for you when you crank up optimisation.

I don't see any obvious missed optimizations in your -O1 output. Except of course setting up RBP as a frame pointer, but you used -fno-omit-frame-pointer so clearly you know why GCC didn't optimize that away.
The function has no local variables
Your function is a non-static class member function, so it has one implicit arg: this in rdi. Which g++ spills to the stack because of -O0. Function args count as local variables.
How does a cyclic move without an effect improve the debugging experience. Please elaborate.
To improve C/C++ debugging: debug-info formats can only describe a C variable's location relative to RSP or RBP, not which register it's currently in. Also, so you can modify any variable with a debugger and continue, getting the expected results as if you'd done that in the C++ abstract machine. Every statement is compiled to a separate block of asm with no values alive in registers (Fun fact: except register int foo: that keyword does affect debug-mode code gen).
Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to G++ and other compilers as well.
Which options would I have to set?
If you're reading / debugging the asm, use at least -Og or higher to disable the debug-mode spill-everything-between-statements behaviour of -O0. Preferably -O2 or -O3 unless you like seeing even more missed optimizations than you'd get with full optimization. But -Og or -O1 will do register allocation and make sane loops (with the conditional branch at the bottom), and various simple optimizations. Although still not the standard peephole of xor-zeroing.
How to remove "noise" from GCC/clang assembly output? explains how to write functions that take args and return a value so you can write functions that don't optimize away.
Loading into RAX and then movq %rax, %rdi is just a side-effect of -O0. GCC spends so little time optimizing the GIMPLE and/or RTL internal representations of the program logic (before emitting x86 asm) that it doesn't even notice it could have loaded into RDI in the first place. Part of the point of -O0 is to compile quickly, as well as consistent debugging.
Why is the subq $8, %rsp statement generated at all?
Because the ABI requires 16-byte stack alignment before a call instruction, and this function did an even number of 8-byte pushes. (call itself pushes a return address). It will go away at -O1 without -fno-omit-frame-pointer because you aren't forcing g++ to push/pop RBP as well as the call-preserved register it actually needs.
Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
Fun fact: clang will often use a dummy push %rcx/pop or something, depending on -mtune options, instead of an 8-byte sub.
If it were a leaf function, g++ would just use the red-zone below RSP for locals, even at -O0. Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
In un-optimized code it's not rare for G++ to allocate an extra 16 bytes it doesn't ever use. Even sometimes with optimization enabled g++ rounds up its stack allocation size too far when aiming for a 16-byte boundary. This is a missed-optimization bug. e.g. Memory allocation and addressing in Assembly

Visual C++ optimization options - how to improve the code output?

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard.
Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.
The fairly simple C++11 code looks like this:
#include <vector>
int main() {
std::vector<int> v = {1, 2, 3, 4};
int sum = 0;
for(int i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
Clang's output with libc++ is quite compact:
main: # #main
mov eax, 10
ret
Visual C++ output, on the other hand, is a multi-page mess.
Am I missing something here or is VS really this bad?
Compiler explorer link:
https://godbolt.org/g/GJYHjE

Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.
Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.
Constant Propagation
In the example, the vector is statically initialized:
std::vector<int> v = {1, 2, 3, 4};
Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.
Here's the abbreviated VS code for doing this:
movdqa xmm0, XMMWORD PTR __xmm#00000004000000030000000200000001
...
movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4#main:
add ebx, DWORD PTR [rdx] ; loop and sum the values
lea rdx, QWORD PTR [rdx+4]
inc r8d
movsxd rax, r8d
cmp rax, r9
jb SHORT $LL4#main
Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.
Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.
New/Delete Optimization
Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper).
This has already generated much discussion on SO:
clang vs gcc - optimization including operator new
Is the compiler allowed to optimize out heap memory allocations?
Optimization of raw new[]/delete[] vs std::vector
Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.
Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):
call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>
and
call void __cdecl operator delete(void * __ptr64)
The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).
This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:
[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.
Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.
STL Implementation
VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.
For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!
main: # #main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
.seh_handler __CxxFrameHandler3, #unwind, #except
# BB#0: # %.lr.ph
pushq %rbp
.Lcfi1:
.seh_pushreg 5
pushq %rsi
.Lcfi2:
.seh_pushreg 6
pushq %rdi
.Lcfi3:
.seh_pushreg 7
pushq %rbx
.Lcfi4:
.seh_pushreg 3
subq $72, %rsp
.Lcfi5:
.seh_stackalloc 72
leaq 64(%rsp), %rbp
.Lcfi6:
.seh_setframe 5, 64
.Lcfi7:
.seh_endprologue
movq $-2, (%rbp)
movl $16, %ecx
callq "??2#YAPEAX_K#Z"
movq %rax, -24(%rbp)
leaq 16(%rax), %rcx
movq %rcx, -8(%rbp)
movups .L.ref.tmp(%rip), %xmm0
movups %xmm0, (%rax)
movq %rcx, -16(%rbp)
movl 4(%rax), %ebx
movl 8(%rax), %esi
movl 12(%rax), %edi
.Ltmp0:
leaq -24(%rbp), %rcx
callq "?_Tidy#?$vector#HV?$allocator#H#std###std##IEAAXXZ"
.Ltmp1:
# BB#1: # %"\01??1?$vector#HV?$allocator#H#std###std##QEAA#XZ.exit"
addl %ebx, %esi
leal 1(%rdi,%rsi), %eax
addq $72, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %rbp
retq
.seh_handlerdata
.long ($cppxdata$main)#IMGREL
.text
Update - Visual Studio 2017
In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:
Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.
Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).
Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.
Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.
However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:
main: # #main
.Lcfi0:
.seh_proc main
# BB#0:
subq $40, %rsp
.Lcfi1:
.seh_stackalloc 40
.Lcfi2:
.seh_endprologue
movl $16, %ecx
callq "??2#YAPEAX_K#Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
movq %rax, %rcx
callq "??3#YAXPEAX#Z" ; void __cdecl operator delete(void * __ptr64)
movl $10, %eax
addq $40, %rsp
retq
.seh_handlerdata
.text
Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.
I'd say that is pretty amazing!
Function Inlining
Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.
When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.
memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.
Summary
To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.
While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0 (un-optimized) loops using ptr++ vs ++ptr, but the question is why the resulting asm performs differently, not anything about how to write better C.
Let's say we have those two loops:
movl $0, -48(%ebp) //Loop counter set to 0
movl $_data, -12(%ebp) //Pointer to the data array
movl %eax, -96(%ebp)
movl %edx, -92(%ebp)
jmp L21
L22:
// ptr++
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Get the loop counter to edx
movl %edx, (%eax) //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
addl $1, -48(%ebp) //Increase the counter
L21:
cmpl $999999, -48(%ebp)
jle L22
and the second one:
movl %eax, -104(%ebp)
movl %edx, -100(%ebp)
movl $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
movl $0, -48(%ebp) //Set the loop counter to 0
jmp L23
L24:
// ++ptr
addl $4, -12(%ebp) //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
movl -12(%ebp), %eax //Store in eax the address
// rest of the loop is the same as the other
movl -48(%ebp), %edx //Store in edx the current loop counter
movl %edx, (%eax) //Move the loop counter value to the current stored address location
addl $1, -48(%ebp) //Increase the loop counter
L23:
cmpl $999999, -48(%ebp)
jle L24
Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.
This asm code is generated from the following two C++ loops:
//FIRST LOOP:
for(;index<size;index++){
*(ptr++) = index;
}
//SECOND LOOP:
ptr = data - 1;
for(index = 0;index<size;index++){
*(++ptr) = index;
}
Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++ is faster than the one with ++ptr.
To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.
ASM analysis
Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl
Is that correct that the LEA operation for computing the correct address is much faster than the ADD (+4) method? Is this the reason for the difference in performance?
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).
Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?

First of all, performance analysis on -O0 compiler output is usually not very interesting or useful.
Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?
Nope, add can run on every ALU execution port on any x86 CPU. lea is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.
See the x86 tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.
add is only worse because it lets gcc -O0 make even worse code by using it with a memory destination and then loading from that.
Compiling with -O0 doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax instead of the xor %eax,%eax you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.
-O0 code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on
As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,
Yes, the mov load of -12(%ebp) won't be ready for about 6 cycles after the load that was part of add's read-modify-write.
whereas in the first loop we can immediately refer the memory
Yes
and in the meanwhile LEAL will compute the next address
No.
Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp). So the loop-carried dependency chain is the same length, and next iteration's lea can't actually start any sooner than in the loop using add
The latency issues might not be the loop throughput bottleneck:
uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)
When gcc -O0 implements ptr++, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.
Assuming an Intel SnB-family CPU:
## ptr++: 1st loop
movl -12(%ebp), %eax //1 uop (load)
leal 4(%eax), %edx //1 uop (ALU only)
movl %edx, -12(%ebp) //1 store-address, 1 store-data
// no load from -12(%ebp) into %eax
... rest the same.
## ++ptr: 2nd loop
addl $4, -12(%ebp) // read-modify-write: 2 fused-domain uops. 4 unfused: 1 load + 1 store-address + 1 store-data
movl -12(%ebp), %eax // load: 1 uop. ~6 cycle latency for %eax to be ready
... rest the same
So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).
However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp) loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).
We could test this theory:
Adding an extra load uop to the lea version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.
movl -12(%ebp), %eax //Get the current address
leal 4(%eax), %edx //Calculate the next address
movl %edx, -12(%ebp) //Store the new (next) address
mov -12(%ebp), %edx
%edx is about to be overwritten by a mov, so there are no dependencies on the result of this load. (The destination of mov is write-only, so it breaks dependency chains, thanks to register renaming.).
So this extra load would bring the lea loop up to the same number and flavour of uops as the add loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.
Update: OP's testing confirmed that an extra unused load slows the lea loop down to about the same speed as the add loop.
Why extra uops matter when we're not hitting execution port throughput bottlenecks
uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.
i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.
Similarly, in the add loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.
Other guesses:
So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.
Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.

How does a file read looks like in gdb

When a program reads from a file, how does this looks like in GDB. I know that the file has t o be opened etc.
Also something like this:
call fopen
...
call fread
...
call fclose
but can you explain me the "copy bytes from file to program memory" action? perhaps with an example of assembly.
In which registers are the adresses of the file/of the programm memory where the file content is stored.
fopen returns a handle. How can i track that?

You can easily check this by yourself by simply writing a small C program:
#include <stdio.h>
int main() {
char c;
FILE *f = fopen("hello.txt", "r");
c=fgetc(f);
fclose(f);
return 0;
}
Then compile it with a switch to output assembly, for gcc: gcc -S -O0 hello.c (the -S tells gcc to output assembly, the -O0 disables the optimizer).
Then have a look at hello.s and you will see the assembly generated for this code:
...
subq $16, %rsp
movl $.LC0, %esi
movl $.LC1, %edi
call fopen
movq %rax, -8(%rbp)
movq -8(%rbp), %rax
movq %rax, %rdi
call fgetc
movb %al, -9(%rbp) <--- here a byte gets moved from a register to variable c
movq -8(%rbp), %rax
movq %rax, %rdi
call fclose
...
Essentially the function fgets or fgetc is called and the result is copied back to your variables. You can easily check the generated code on your platform with your compiler and the input method you are interested in.
Note that there may be more possible assembly variants if other libraries are used for input.
If the optimizer was used when the program was compiled, the assembly might even look more strange.

Well, it seems like you are looking to learn assembly. Fopen return a handle to the file. When you will call Fopen, and basically many other functions the return value will be placed in the EAX register, but it depends of hardware also. So if you want to track a handle of a file, after calling fopen in ASM you should get it from EAX register. If you are intersted in more details you should look up for some tutorials or books of assembly. For example Assembly Programming tutorial

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js