Optimizations in copying a range - c++

While reading sources of GNU C++ standard library, I found some code for copying (or moving, if possible) a range of iterators (file stl_algobase.h), which uses template specialization for some optimizations. A comment corresponding to it says:
All of these auxiliary structs serve two purposes. (1) Replace calls to copy with memmove whenever possible. (Memmove, not memcpy, because the input and output ranges are permitted to overlap.) (2) If we're using random access iterators, then write the loop as a for loop with an explicit count.
The specialization using the second optimization looks like this:
template<>
struct __copy_move<false, false, random_access_iterator_tag>
{
template<typename _II, typename _OI>
static _OI
__copy_m(_II __first, _II __last, _OI __result)
{
typedef typename iterator_traits<_II>::difference_type _Distance;
for(_Distance __n = __last - __first; __n > 0; --__n)
{
*__result = *__first;
++__first;
++__result;
}
return __result;
}
};
So, I have two questions concerning this
How can memmove increase the spead of copying? Is it implemented somehow more effective than a simple loop?
How can using explicit counter in the for loop affect the performance?
Some clarification: I would like to see some optimization examples actually used by compilers, not elaboration on the possibility of those.
Edit: the first question is quite nicely answered here.

Answering the second question, the explicit count does indeed lead to more opportunities for loop unrolling, though even with pointers iterating through a fixed size array, gcc does not perform aggressive unrolling unless asked to do so with -funroll-loops. The other gain comes from a potentially simpler end-of-loop comparison test for non-trivial iterators.
On a Core i7-4770, I benchmarked the time spent performing a copy of a maximally-aligned 2048-long integer array with a while loop and explicit count copy implementation. (Times in microseconds, includes call overhead; minimum of 200 samples of a timing loop with warm-up.)
while count
gcc -O3 0.179 0.178
gcc -O3 -march=native 0.097 0.095
gcc -O3 -march=native -funroll-loops 0.066 0.066
In each case, the generated code is very similar; the while version does a bit more work at the end in each case, handling checks that there aren't any entries left to copy that didn't fill out a whole 128-bit (SSE) or 256-bit (AVX) register, but these are pretty much taken care of by the branch predictor. The gcc -O3 assembly for each is as follows (leaving out assembler directives). while version:
array_copy_while(int (&) [2048], int (&) [2048]):
leaq 8192(%rdi), %rax
leaq 4(%rdi), %rdx
movq %rax, %rcx
subq %rdx, %rcx
movq %rcx, %rdx
shrq $2, %rdx
leaq 1(%rdx), %r8
cmpq $8, %r8
jbe .L11
leaq 16(%rsi), %rdx
cmpq %rdx, %rdi
leaq 16(%rdi), %rdx
setae %cl
cmpq %rdx, %rsi
setae %dl
orb %dl, %cl
je .L11
movq %r8, %r9
xorl %edx, %edx
xorl %ecx, %ecx
shrq $2, %r9
leaq 0(,%r9,4), %r10
.L9:
movdqa (%rdi,%rdx), %xmm0
addq $1, %rcx
movdqa %xmm0, (%rsi,%rdx)
addq $16, %rdx
cmpq %rcx, %r9
ja .L9
leaq 0(,%r10,4), %rdx
addq %rdx, %rdi
addq %rdx, %rsi
cmpq %r10, %r8
je .L1
movl (%rdi), %edx
movl %edx, (%rsi)
leaq 4(%rdi), %rdx
cmpq %rdx, %rax
je .L1
movl 4(%rdi), %edx
movl %edx, 4(%rsi)
leaq 8(%rdi), %rdx
cmpq %rdx, %rax
je .L20
movl 8(%rdi), %eax
movl %eax, 8(%rsi)
ret
.L11:
movl (%rdi), %edx
addq $4, %rdi
addq $4, %rsi
movl %edx, -4(%rsi)
cmpq %rdi, %rax
jne .L11
.L1:
rep ret
.L20:
rep ret
count version:
array_copy_count(int (&) [2048], int (&) [2048]):
leaq 16(%rsi), %rax
movl $2048, %ecx
cmpq %rax, %rdi
leaq 16(%rdi), %rax
setae %dl
cmpq %rax, %rsi
setae %al
orb %al, %dl
je .L23
movw $512, %cx
xorl %eax, %eax
xorl %edx, %edx
.L29:
movdqa (%rdi,%rax), %xmm0
addq $1, %rdx
movdqa %xmm0, (%rsi,%rax)
addq $16, %rax
cmpq %rdx, %rcx
ja .L29
rep ret
.L23:
xorl %eax, %eax
.L31:
movl (%rdi,%rax,4), %edx
movl %edx, (%rsi,%rax,4)
addq $1, %rax
cmpq %rax, %rcx
jne .L31
rep ret
When the iterators are more complicated however, the difference becomes more pronounced. Consider a hypothetical container that stores values in a series of fixed-sized allocated buffers. An iterator comprises a pointer to the chain of blocks, a block index and a block offset. Comparison of two iterators requires potentially two comparisons. Incrementing the iterator requires checking if we pop over a block boundary.
I made such a container, and performed the same benchmark for copying a 2000-long container of int, with a block size of 512 ints.
while count
gcc -O3 1.560 2.818
gcc -O3 -march=native 1.660 2.854
gcc -O3 -march=native -funroll-loops 1.432 2.858
That looks weird! Oh wait, it's because gcc 4.8 has a misoptimisation, where it uses conditional moves instead of nice, branch-predictor friendly comparisons. (gcc bug 56309).
Let's try icc on a different machine (Xeon E5-2670).
while count
icc -O3 3.952 3.704
icc -O3 -xHost 3.898 3.624
This is closer to what we'd expect, a small but significant improvement from the simpler loop condition. On a different architecture, the gain is more pronounced. clang targeting a PowerA2 at 1.6GHz:
while count
bgclang -O3 36.528 31.623
I'll omit the assembly, as it's quite long!

Related

Expression template code not optimized fully

I have the following linear algebra function call (vector-vector addition) in C++.
int m = 4;
blasfeo_dvec one, two, three;
blasfeo_allocate_dvec(m, &one);
blasfeo_allocate_dvec(m, &two);
blasfeo_allocate_dvec(m, &three);
// initialize vectors ... (omitted)
blasfeo_daxpy(m, 1.0, &one, 0, &two, 0, &three, 0);
Using expression templates (ETs), we can wrap it as follows:
three = one + two;
where the vector struct looks like
struct blasfeo_dvec {
int m; // length
int pm; // packed length
double *pa; // pointer to a pm array of doubles, the first is aligned to cache line size
int memsize; // size of needed memory
void operator=(const vec_expression_sum<blasfeo_dvec, blasfeo_dvec> expr) {
blasfeo_daxpy(m, 1.0, (blasfeo_dvec *) &expr.vec_a, 0, (blasfeo_dvec *) &expr.vec_b, 0, this, 0);
}
};
The cast to non-const is necessary because blasfeo_daxpy takes non-const pointers. The ET code is simply
template<typename Ta, typename Tb>
struct vec_expression_sum {
const Ta vec_a;
const Tb vec_b;
vec_expression_sum(const Ta va, const Tb vb) : vec_a {va}, vec_b {vb} {}
};
template<typename Ta, typename Tb>
auto operator+(const Ta a, const Tb b) {
return vec_expression_sum<Ta, Tb>(a, b);
}
The 'native' call, i.e. blasfeo_daxpy(...) generates the following assembly:
; allocation and initialization omitted ...
movl $0, (%rsp)
movl $4, %edi
xorl %edx, %edx
xorl %r8d, %r8d
movsd LCPI0_0(%rip), %xmm0 ## xmm0 = mem[0],zero
movq %r14, %rsi
movq %rbx, %rcx
movq %r15, %r9
callq _blasfeo_daxpy
...
which is exactly what you would expect. The ET code is quite a bit longer:
; allocation :
leaq -120(%rbp), %rbx
movl $4, %edi
movq %rbx, %rsi
callq _blasfeo_allocate_dvec
leaq -96(%rbp), %r15
movl $4, %edi
movq %r15, %rsi
callq _blasfeo_allocate_dvec
leaq -192(%rbp), %r14
movl $4, %edi
movq %r14, %rsi
callq _blasfeo_allocate_dvec
; initialization code omitted
; operator+ :
movq -104(%rbp), %rax
movq %rax, -56(%rbp)
movq -120(%rbp), %rax
movq -112(%rbp), %rcx
movq %rcx, -64(%rbp)
movq %rax, -72(%rbp)
; vec_expression_sum :
movq -80(%rbp), %rax
movq %rax, -32(%rbp)
movq -96(%rbp), %rax
movq -88(%rbp), %rcx
movq %rcx, -40(%rbp)
movq %rax, -48(%rbp)
movq -32(%rbp), %rax
movq %rax, -128(%rbp)
movq -40(%rbp), %rax
movq %rax, -136(%rbp)
movq -48(%rbp), %rax
movq %rax, -144(%rbp)
movq -56(%rbp), %rax
movq %rax, -152(%rbp)
movq -72(%rbp), %rax
movq -64(%rbp), %rcx
movq %rcx, -160(%rbp)
movq %rax, -168(%rbp)
leaq -144(%rbp), %rcx
; blasfeo_daxpy :
movl -192(%rbp), %edi
movl $0, (%rsp)
leaq -168(%rbp), %rsi
xorl %edx, %edx
xorl %r8d, %r8d
movsd LCPI0_0(%rip), %xmm0 ## xmm0 = mem[0],zero
movq %r14, %r9
callq _blasfeo_daxpy
...
It involves quite a bit of copying, namely the fields of blasfeo_dvec. I (naively, maybe) hoped that the ET code would generate the exact same code as the native call, given that everything is fixed at compile time and const, but it doesn't.
The question is: why the extra loads? And is there a way of getting fully 'optimized' code? (edit: I use Apple LLVM version 8.1.0 (clang-802.0.42) with -std=c++14 -O3)
Note: I read and understood this and this post on a similar topic, but they unfortunately do not contain an answer to my question.

Why use leal instead of incq?

I was fooling around and found that the following
#include <stdio.h>
void f(int& x){
x+=1;
}
int main(){
int a = 12;
f(a);
printf("%d\n",a);
}
when translated by g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4 with g++ main.cpp -S produces this assembly (showing only the relevant parts)
_Z1fRi:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
movq -8(%rbp), %rax
movl (%rax), %eax
leal 1(%rax), %edx
movq -8(%rbp), %rax
movl %edx, (%rax)
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $12, -4(%rbp)
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Z1fRi
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
Question: Why would the compiler choose to use leal instead of incq? Or am I missing something?
You compiled without optimization. GCC does not make any effort to select particularly well-fitting instructions when building in "debug" mode; it just focuses on generating the code as quickly as possible (and with an eye to making debugging easier—e.g., the ability to set breakpoints on source code lines).
When I enable optimizations by passing the -O2 switch, I get:
_Z1fRi:
addl $1, (%rdi)
ret
With generic tuning, the addl is preferred because some Intel processors (specifically Pentium 4, but also possibly Knight's Landing) have a false flags dependency.
With -march=k8, incl is used instead.
There is sometimes a use-case for leal in optimized code, though, and that is when you want to increment a register's value and store the result in a different register. Using leal in this way would allow you to preserve the register's original value, without needing an additional movl instruction. Another advantage of leal over incl/addl is that leal doesn't affect the flags, which can be useful in instruction scheduling.

C++ GCC Optimization Speed slows down when local variable is copied to global variable

I have a question regarding GCC's optimization flags and how they work.
I have a very long piece of code that utilizes all local arrays and variables. At the end of the code, I copy the contents of the local array to a global array. Here is an extremely stripped down example of my code:
uint8_t globalArray[16]={0};
void func()
{
unsigned char localArray[16]={0};
for (int r=0; r<1000000; r++)
{
**manipulate localArray with a lot of calculations**
}
memcpy(&globalArray,localArray,16);
}
Here's the approximate speed of the code in three different scenarios:
Without "-O3" optimization: 3.203s
With "-O3" optimization: 1.457s
With "-O3" optimization and without the final memcpy(&globalArray,localArray,16); statement: 0.015s
Without copying the local array into the global array, the code runs almost 100 times faster. I know that the global array is stored in the memory and the local array is stored in registers. My question is:
Why does just copying 16 elements of a local array to a global array cause 100 times slower execution? I have searched this forum and online and I cannot find a definite answer to this particular scenario of mine.
Is there any way that I can extract the contents of the local variable without the speed loss?
Thank you in advance to anyone that can help me with this problem.
Without the memcpy, your compiler will likely see that localArray is never read from, so it doesn't need to do any of the calculations in the loop body.
Take this code as an example:
uint8_t globalArray[16]={0};
void func()
{
unsigned char localArray[16]={0};
for (int r=0; r<1000000; r++)
{
localArray[r%16] = r;
}
memcpy(&globalArray,localArray,16);
}
Clang 3.7.1 with -O3 outputs this assembly:
func(): # #func()
# BB#0:
xorps %xmm0, %xmm0
movaps %xmm0, -24(%rsp)
#DEBUG_VALUE: r <- 0
xorl %eax, %eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
#DEBUG_VALUE: r <- 0
movl %eax, %ecx
sarl $31, %ecx
shrl $28, %ecx
leal (%rcx,%rax), %ecx
andl $-16, %ecx
movl %eax, %edx
subl %ecx, %edx
movslq %edx, %rcx
movb %al, -24(%rsp,%rcx)
leal 1(%rax), %ecx
#DEBUG_VALUE: r <- ECX
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 1(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 1(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
leal 2(%rax), %ecx
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 2(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 2(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
leal 3(%rax), %ecx
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 3(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 3(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
leal 4(%rax), %ecx
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 4(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 4(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
addl $5, %eax
cmpl $1000000, %eax # imm = 0xF4240
jne .LBB0_1
# BB#2:
movaps -24(%rsp), %xmm0
movaps %xmm0, globalArray(%rip)
retq
For the same code without the memcpy, it outputs this:
func(): # #func()
# BB#0:
#DEBUG_VALUE: r <- 0
retq
Even if you know nothing about assembly, it's clear to see that the latter just does nothing.

Can tail recursion apply to this code?

I have a simple piece of code, that addresses this (poorly stated, out of place) question :
template<typename It>
bool isAlpha(It first, It last)
{
return (first != last && *first != '\0') ?
(isalpha(static_cast<int>(*first)) && isAlpha(++first, last)) : true;
}
I'm trying to figure out how can I go about implementing it in a tail recursive fashion, and although there are great sources like this answer, I can't wrap my mind around it.
Can anyone help ?
EDIT
I'm placing the disassembly code below; The compiler is gcc 4.9.0 compiling with -std=c++11 -O2 -Wall -pedantic the assembly output is
bool isAlpha<char const*>(char const*, char const*):
cmpq %rdi, %rsi
je .L5
movzbl (%rdi), %edx
movl $1, %eax
testb %dl, %dl
je .L12
pushq %rbp
pushq %rbx
leaq 1(%rdi), %rbx
movq %rsi, %rbp
subq $8, %rsp
.L3:
movsbl %dl, %edi
call isalpha
testl %eax, %eax
jne .L14
xorl %eax, %eax
.L2:
addq $8, %rsp
popq %rbx
popq %rbp
.L12:
rep ret
.L14:
cmpq %rbp, %rbx
je .L7
addq $1, %rbx
movzbl -1(%rbx), %edx
testb %dl, %dl
jne .L3
.L7:
movl $1, %eax
jmp .L2
.L5:
movl $1, %eax
ret
To clarify cdhowie's point, the function can be rewritten as follows (unless I made a mistake):
bool isAlpha(It first, It last)
{
if (first == last)
return true;
if (*first == '\0')
return true;
if (!isalpha(static_cast<int>(*first))
return false;
return isAlpha(++first, last);
}
Which would indeed allow for trivial tail call elimination.
This is normally a job for the compiler, though.

Can modern compilers unroll `for` loops expressed using begin and end iterators

Consider the following code
vector<double> v;
// fill v
const vector<double>::iterator end =v.end();
for(vector<double>::iterator i = v.bgin(); i != end; ++i) {
// do stuff
}
Are compilers like g++, clang++, icc able to unroll loops like this. Unfortunately I do not know assembly to be able verify from the output whether the loop gets unrolled or not. (and I only have access to g++.)
To me it seems that this will require more smartness than usual on behalf of the compiler, first to deduce that the iterator is a random access iterator, and then figure out the number of times the loop is executed. Can compilers do this when optimization is enabled ?
Thanks for your replies, and before some of you start lecturing about premature optimization, this is an excercise in curiosity.
To me it seems that this will require more smartness than usual on behalf of the compiler, first to deduce that the iterator is a random access iterator, and then figure out the number of times the loop is executed.
The STL, being comprised entirely of templates, has all the code inline. So, random access iterators reduce to pointers already when the compiler begins to apply optimizations. One of the reasons the STL was created was so that there would be less need for a programmer to outwit the compiler. You should rely on the STL to do the right thing until proven otherwise.
Of course, it is still up to you to choose the proper tool from the STL to use...
Edit: There was discussion about whether g++ does any loop unrolling. On the versions that I am using, loop unrolling is not part of -O, -O2, or -O3, and I get identical assembly for the latter two levels with the following code:
void foo (std::vector<int> &v) {
volatile int c = 0;
const std::vector<int>::const_iterator end = v.end();
for (std::vector<int>::iterator i = v.begin(); i != end; ++i) {
*i = c++;
}
}
With the corresponding assembly -O2 assembly:
_Z3fooRSt6vectorIiSaIiEE:
.LFB435:
movq 8(%rdi), %rcx
movq (%rdi), %rax
movl $0, -4(%rsp)
cmpq %rax, %rcx
je .L4
.p2align 4,,10
.p2align 3
.L3:
movl -4(%rsp), %edx
movl %edx, (%rax)
addq $4, %rax
addl $1, %edx
cmpq %rax, %rcx
movl %edx, -4(%rsp)
jne .L3
.L4:
rep
ret
With the -funroll-loops option added, the function expands into something much much larger. But, the documentation warns about this option:
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations). This option makes code larger, and may or may not make it run faster.
As a further argument to dissuade you from unrolling loops yourself, I'll finish this answer with an illustration of applying Duff's Device to the foo function above:
void foo_duff (std::vector<int> &v) {
volatile int c = 0;
const std::vector<int>::const_iterator end = v.end();
std::vector<int>::iterator i = v.begin();
switch ((end - i) % 4) do {
case 0: *i++ = c++;
case 3: *i++ = c++;
case 2: *i++ = c++;
case 1: *i++ = c++;
} while (i != end);
}
GCC has another loop optimization flag:
-ftree-loop-optimize
Perform loop optimizations on trees. This flag is enabled by default at -O and higher.
So, the -O option enables simple loop optimizations for the innermost loops, including complete loop unrolling (peeling) for loops with a fixed number of iterations. (Thanks to doc for pointing this out to me.)
I would propose that whether or not the compiler CAN unroll the loop, with modern pipelined architectures and caches, unless your "do stuff" is trivial, there is little benefit in doing so, and in many cases doing so would be a performance HIT instead of a boon. If your "do stuff" is nontrivial, unrolling the loop will create multiple copies of this nontrivial code, which will take extra time to load into the cache, significantly slowing down the first iteration through the unrolled loop. At the same time, it will evict more code from the cache, which may have been necessary for performing the "do stuff" if it makes any function calls, which would then need to be reloaded into the cache again. The purpose for unrolling loops made a lot of sense before cacheless pipelined non-branch-predictive architectures, with the goal being to reduce the overhead associated with the loop logic. Nowadays with cache-based pipelined branch-predictive hardware, your cpu will be pipelined well into the next loop iteration, speculatively executing the loop code again, by the time you detect the i==end exit condition, at which point the processor will throw out that final speculatively-executed set of results. In such an architecture, loop unrolling makes very little sense. It would further bloat code for virtually no benefit.
The short answer is yes. It will unroll as much as it can. In your case, it depends how you define end obviously (I assume your example is generic). Not only will most modern compilers unroll, but they will also vectorize and do other optimizations that will often blow your own solutions out of the water.
So what I'm saying is don't prematurely optimize! Just kidding :)
Simple answer: generally NO! At least when it comes to complete loop unrolling.
Let's test loop unrolling on this simple, dirty-coded (for testing purposes) structure.
struct Test
{
Test(): begin(arr), end(arr + 4) {}
double * begin;
double * end;
double arr[4];
};
First let's take counted loop and compile it without any optimizations.
double counted(double param, Test & d)
{
for (int i = 0; i < 4; i++)
param += d.arr[i];
return param;
}
Here's what gcc 4.9 produces.
counted(double, Test&):
pushq %rbp
movq %rsp, %rbp
movsd %xmm0, -24(%rbp)
movq %rdi, -32(%rbp)
movl $0, -4(%rbp)
jmp .L2
.L3:
movq -32(%rbp), %rax
movl -4(%rbp), %edx
movslq %edx, %rdx
addq $2, %rdx
movsd (%rax,%rdx,8), %xmm0
movsd -24(%rbp), %xmm1
addsd %xmm0, %xmm1
movq %xmm1, %rax
movq %rax, -24(%rbp)
addl $1, -4(%rbp)
.L2:
cmpl $3, -4(%rbp)
jle .L3
movq -24(%rbp), %rax
movq %rax, -40(%rbp)
movsd -40(%rbp), %xmm0
popq %rbp
ret
As expected loop hasn't been unrolled and, since no optimizations were performed, code is generally very verbose. Now let's turn on -O3 flag. Produced disassembly:
counted(double, Test&):
addsd 16(%rdi), %xmm0
addsd 24(%rdi), %xmm0
addsd 32(%rdi), %xmm0
addsd 40(%rdi), %xmm0
ret
Voila, loop has been unrolled this time.
Now let's take a look at iterated loop. Function containing the loop will look like this.
double iterated(double param, Test & d)
{
for (double * it = d.begin; it != d.end; ++it)
param += *it;
return param;
}
Still using -O3 flag, let's take a look at disassembly.
iterated(double, Test&):
movq (%rdi), %rax
movq 8(%rdi), %rdx
cmpq %rdx, %rax
je .L3
.L4:
addsd (%rax), %xmm0
addq $8, %rax
cmpq %rdx, %rax
jne .L4
.L3:
rep ret
Code looks better than in the very first case, because optimizations were performed, but loop hasn't been unrolled this time!
What about funroll-loops and funroll-all-loops flags? They will produce result similar to this
iterated(double, Test&):
movq (%rdi), %rsi
movq 8(%rdi), %rcx
cmpq %rcx, %rsi
je .L3
movq %rcx, %rdx
leaq 8(%rsi), %rax
addsd (%rsi), %xmm0
subq %rsi, %rdx
subq $8, %rdx
shrq $3, %rdx
andl $7, %edx
cmpq %rcx, %rax
je .L43
testq %rdx, %rdx
je .L4
cmpq $1, %rdx
je .L29
cmpq $2, %rdx
je .L30
cmpq $3, %rdx
je .L31
cmpq $4, %rdx
je .L32
cmpq $5, %rdx
je .L33
cmpq $6, %rdx
je .L34
addsd (%rax), %xmm0
leaq 16(%rsi), %rax
.L34:
addsd (%rax), %xmm0
addq $8, %rax
.L33:
addsd (%rax), %xmm0
addq $8, %rax
.L32:
addsd (%rax), %xmm0
addq $8, %rax
.L31:
addsd (%rax), %xmm0
addq $8, %rax
.L30:
addsd (%rax), %xmm0
addq $8, %rax
.L29:
addsd (%rax), %xmm0
addq $8, %rax
cmpq %rcx, %rax
je .L44
.L4:
addsd (%rax), %xmm0
addq $64, %rax
addsd -56(%rax), %xmm0
addsd -48(%rax), %xmm0
addsd -40(%rax), %xmm0
addsd -32(%rax), %xmm0
addsd -24(%rax), %xmm0
addsd -16(%rax), %xmm0
addsd -8(%rax), %xmm0
cmpq %rcx, %rax
jne .L4
.L3:
rep ret
.L44:
rep ret
.L43:
rep ret
Compare results with unrolled loop for counted loop. It's clearly not the same. What we see here is that gcc divided the loop into 8 element chunks. This can increase performance in some cases, because loop exit condition is checked once per 8 normal loop iterations. With additional flags vectorization could be also performed. But it isn't complete loop unrolling.
Iterated loop will be unrolled however if Test object is not a function argument.
double iteratedLocal(double param)
{
Test d;
for (double * it = d.begin; it != d.end; ++it)
param += *it;
return param;
}
Disassembly produced with only -O3 flag:
iteratedLocal(double):
addsd -40(%rsp), %xmm0
addsd -32(%rsp), %xmm0
addsd -24(%rsp), %xmm0
addsd -16(%rsp), %xmm0
ret
As you can see loop has been unrolled. This is because compiler can now safely assume that end has fixed value, while it couldn't predict that for function argument.
Test structure is statically allocated however. Things are more complicated with dynamically allocated structures like std::vector. From my observations on modified Test structure, so that it ressembles dynamically allocated container, it looks like gcc tries its best to unroll loops, but in most cases generated code is not as simple as one above.
As you ask for other compilers, here's output from clang 3.4.1 (-O3 flag)
counted(double, Test&): # #counted(double, Test&)
addsd 16(%rdi), %xmm0
addsd 24(%rdi), %xmm0
addsd 32(%rdi), %xmm0
addsd 40(%rdi), %xmm0
ret
iterated(double, Test&): # #iterated(double, Test&)
movq (%rdi), %rax
movq 8(%rdi), %rcx
cmpq %rcx, %rax
je .LBB1_2
.LBB1_1: # %.lr.ph
addsd (%rax), %xmm0
addq $8, %rax
cmpq %rax, %rcx
jne .LBB1_1
.LBB1_2: # %._crit_edge
ret
iteratedLocal(double): # #iteratedLocal(double)
leaq -32(%rsp), %rax
movq %rax, -48(%rsp)
leaq (%rsp), %rax
movq %rax, -40(%rsp)
xorl %eax, %eax
jmp .LBB2_1
.LBB2_2: # %._crit_edge4
movsd -24(%rsp,%rax), %xmm1
addq $8, %rax
.LBB2_1: # =>This Inner Loop Header: Depth=1
movaps %xmm0, %xmm2
cmpq $24, %rax
movaps %xmm1, %xmm0
addsd %xmm2, %xmm0
jne .LBB2_2
ret
Intel's icc 13.01 (-O3 flag)
counted(double, Test&):
addsd 16(%rdi), %xmm0 #24.5
addsd 24(%rdi), %xmm0 #24.5
addsd 32(%rdi), %xmm0 #24.5
addsd 40(%rdi), %xmm0 #24.5
ret #25.10
iterated(double, Test&):
movq (%rdi), %rdx #30.26
movq 8(%rdi), %rcx #30.41
cmpq %rcx, %rdx #30.41
je ..B3.25 # Prob 50% #30.41
subq %rdx, %rcx #30.7
movb $0, %r8b #30.7
lea 7(%rcx), %rax #30.7
sarq $2, %rax #30.7
shrq $61, %rax #30.7
lea 7(%rax,%rcx), %rcx #30.7
sarq $3, %rcx #30.7
cmpq $16, %rcx #30.7
jl ..B3.26 # Prob 10% #30.7
movq %rdx, %rdi #30.7
andq $15, %rdi #30.7
je ..B3.6 # Prob 50% #30.7
testq $7, %rdi #30.7
jne ..B3.26 # Prob 10% #30.7
movl $1, %edi #30.7
..B3.6: # Preds ..B3.5 ..B3.3
lea 16(%rdi), %rax #30.7
cmpq %rax, %rcx #30.7
jl ..B3.26 # Prob 10% #30.7
movq %rcx, %rax #30.7
xorl %esi, %esi #30.7
subq %rdi, %rax #30.7
andq $15, %rax #30.7
negq %rax #30.7
addq %rcx, %rax #30.7
testq %rdi, %rdi #30.7
jbe ..B3.11 # Prob 2% #30.7
..B3.9: # Preds ..B3.7 ..B3.9
addsd (%rdx,%rsi,8), %xmm0 #31.9
incq %rsi #30.7
cmpq %rdi, %rsi #30.7
jb ..B3.9 # Prob 82% #30.7
..B3.11: # Preds ..B3.9 ..B3.7
pxor %xmm6, %xmm6 #28.12
movaps %xmm6, %xmm7 #28.12
movaps %xmm6, %xmm5 #28.12
movsd %xmm0, %xmm7 #28.12
movaps %xmm6, %xmm4 #28.12
movaps %xmm6, %xmm3 #28.12
movaps %xmm6, %xmm2 #28.12
movaps %xmm6, %xmm1 #28.12
movaps %xmm6, %xmm0 #28.12
..B3.12: # Preds ..B3.12 ..B3.11
addpd (%rdx,%rdi,8), %xmm7 #31.9
addpd 16(%rdx,%rdi,8), %xmm6 #31.9
addpd 32(%rdx,%rdi,8), %xmm5 #31.9
addpd 48(%rdx,%rdi,8), %xmm4 #31.9
addpd 64(%rdx,%rdi,8), %xmm3 #31.9
addpd 80(%rdx,%rdi,8), %xmm2 #31.9
addpd 96(%rdx,%rdi,8), %xmm1 #31.9
addpd 112(%rdx,%rdi,8), %xmm0 #31.9
addq $16, %rdi #30.7
cmpq %rax, %rdi #30.7
jb ..B3.12 # Prob 82% #30.7
addpd %xmm6, %xmm7 #28.12
addpd %xmm4, %xmm5 #28.12
addpd %xmm2, %xmm3 #28.12
addpd %xmm0, %xmm1 #28.12
addpd %xmm5, %xmm7 #28.12
addpd %xmm1, %xmm3 #28.12
addpd %xmm3, %xmm7 #28.12
movaps %xmm7, %xmm0 #28.12
unpckhpd %xmm7, %xmm0 #28.12
addsd %xmm0, %xmm7 #28.12
movaps %xmm7, %xmm0 #28.12
..B3.14: # Preds ..B3.13 ..B3.26
lea 1(%rax), %rsi #30.7
cmpq %rsi, %rcx #30.7
jb ..B3.25 # Prob 50% #30.7
subq %rax, %rcx #30.7
cmpb $1, %r8b #30.7
jne ..B3.17 # Prob 50% #30.7
..B3.16: # Preds ..B3.17 ..B3.15
xorl %r8d, %r8d #30.7
jmp ..B3.21 # Prob 100% #30.7
..B3.17: # Preds ..B3.15
cmpq $2, %rcx #30.7
jl ..B3.16 # Prob 10% #30.7
movq %rcx, %r8 #30.7
xorl %edi, %edi #30.7
pxor %xmm1, %xmm1 #28.12
lea (%rdx,%rax,8), %rsi #31.19
andq $-2, %r8 #30.7
movsd %xmm0, %xmm1 #28.12
..B3.19: # Preds ..B3.19 ..B3.18
addpd (%rsi,%rdi,8), %xmm1 #31.9
addq $2, %rdi #30.7
cmpq %r8, %rdi #30.7
jb ..B3.19 # Prob 82% #30.7
movaps %xmm1, %xmm0 #28.12
unpckhpd %xmm1, %xmm0 #28.12
addsd %xmm0, %xmm1 #28.12
movaps %xmm1, %xmm0 #28.12
..B3.21: # Preds ..B3.20 ..B3.16
cmpq %rcx, %r8 #30.7
jae ..B3.25 # Prob 2% #30.7
lea (%rdx,%rax,8), %rax #31.19
..B3.23: # Preds ..B3.23 ..B3.22
addsd (%rax,%r8,8), %xmm0 #31.9
incq %r8 #30.7
cmpq %rcx, %r8 #30.7
jb ..B3.23 # Prob 82% #30.7
..B3.25: # Preds ..B3.23 ..B3.21 ..B3.14 ..B3.1
ret #32.14
..B3.26: # Preds ..B3.2 ..B3.6 ..B3.4 # Infreq
movb $1, %r8b #30.7
xorl %eax, %eax #30.7
jmp ..B3.14 # Prob 100% #30.7
iteratedLocal(double):
lea -8(%rsp), %rax #8.13
lea -40(%rsp), %rdx #7.11
cmpq %rax, %rdx #33.41
je ..B4.15 # Prob 50% #33.41
movq %rax, -48(%rsp) #32.12
movq %rdx, -56(%rsp) #32.12
xorl %eax, %eax #33.7
..B4.13: # Preds ..B4.11 ..B4.13
addsd -40(%rsp,%rax,8), %xmm0 #34.9
incq %rax #33.7
cmpq $4, %rax #33.7
jb ..B4.13 # Prob 82% #33.7
..B4.15: # Preds ..B4.13 ..B4.1
ret #35.14
To avoid misunderstandings. If counted loop condition would rely on external parameter like this one.
double countedDep(double param, Test & d)
{
for (int i = 0; i < d.size; i++)
param += d.arr[i];
return param;
}
Such loop also will not be unrolled.