GCC generates undesired assembly code

GCC generates undesired assembly code - c++

I'm working on vectorizing loops, and GCC is giving me a hard time.
When I look at the assembly code it generates, I see a lot of strange lines that I would like to get rid of.
For example, with vectorization, I've learnt that you can avoid a lot of extra assembly lines by giving additionnal information to GCC about array alignment.
http://locklessinc.com/articles/vectorize/
Here is my experiment.
#define SIZE 1024
void itwillwork (const uint16_t * a, const uint16_t * b, uint16_t * comp) {
int i = 0;
comp[i]=a[i]|b[i];
}
Generates simple assembly:
.globl _ZN8Test_LUT7performEv
23 _ZN8Test_LUT7performEv:
24 .LFB664:
25 .cfi_startproc
26 0020 488B4710 movq 16(%rdi), %rax
27 0024 488B4F08 movq 8(%rdi), %rcx
28 0028 488B5720 movq 32(%rdi), %rdx
29 002c 0FB700 movzwl (%rax), %eax
30 002f 660B01 orw (%rcx), %ax
31 0032 668902 movw %ax, (%rdx)
32 0035 C3 ret
33 .cfi_endproc
But, even if I was expecting a few extra lines, I am very surprised by what I got after adding a loop :
#define SIZE 1024
void itwillwork (const uint16_t * a, const uint16_t * b, uint16_t * comp) {
int i = 0;
for(i=0;i<SIZE;i++)
comp[i]=a[i]|b[i];
}
Generates this assembly with a lot more lines:
233 _Z10itwillworkPKtS0_Pt:
234 .LFB663:
235 .cfi_startproc
236 0250 488D4210 leaq 16(%rdx), %rax
237 0254 488D4E10 leaq 16(%rsi), %rcx
238 0258 4839F0 cmpq %rsi, %rax
239 025b 410F96C0 setbe %r8b
240 025f 4839CA cmpq %rcx, %rdx
241 0262 0F93C1 setnb %cl
242 0265 4108C8 orb %cl, %r8b
243 0268 743E je .L55
244 026a 4839F8 cmpq %rdi, %rax
245 026d 488D4710 leaq 16(%rdi), %rax
246 0271 0F96C1 setbe %cl
247 0274 4839C2 cmpq %rax, %rdx
248 0277 0F93C0 setnb %al
249 027a 08C1 orb %al, %cl
250 027c 742A je .L55
251 027e 31C0 xorl %eax, %eax
252 .p2align 4,,10
253 .p2align 3
254 .L57:
255 0280 F30F6F0C movdqu (%rsi,%rax), %xmm1
255 06
256 0285 F30F6F04 movdqu (%rdi,%rax), %xmm0
256 07
257 028a 660FEBC1 por %xmm1, %xmm0
258 028e F30F7F04 movdqu %xmm0, (%rdx,%rax)
258 02
259 0293 4883C010 addq $16, %rax
260 0297 483D0008 cmpq $2048, %rax
260 0000
261 029d 75E1 jne .L57
262 029f F3C3 rep ret
263 .p2align 4,,10
264 02a1 0F1F8000 .p2align 3
264 000000
265 .L55:
266 02a8 31C0 xorl %eax, %eax
267 02aa 660F1F44 .p2align 4,,10
267 0000
268 .p2align 3
269 .L58:
270 02b0 0FB70C06 movzwl (%rsi,%rax), %ecx
271 02b4 660B0C07 orw (%rdi,%rax), %cx
272 02b8 66890C02 movw %cx, (%rdx,%rax)
273 02bc 4883C002 addq $2, %rax
274 02c0 483D0008 cmpq $2048, %rax
274 0000
275 02c6 75E8 jne .L58
276 02c8 F3C3 rep ret
277 .cfi_endproc
Both were compiled with gcc 4.8.4 in release mode, -O2 -ftree-vectorize -msse2.
Can somebody help me get rid of those lines? Or, if it's impossible, can you tell me why they are there ?
Update :
I've tried the tricks there http://locklessinc.com/articles/vectorize/, but I get another issue:
#define SIZE 1024
void itwillwork (const uint16_t * a, const uint16_t * b, uint16_t * comp) {
int i = 0;
for(i=0;i<SIZE;i++)
comp[i]=a[i]|b[i];
}
A few assembly lines are generated for this function, I get it.
But when I call this function from somewhere else :
itwillwork(a,b,c);
There is no call instruction : the long list of instructions of "itwillwork" (the same as above) are used directly.
Am I missing something ? (the "extra lines" are the problem, not the inline call)

You are getting "weird" code because GCC cannot make assumptions about the alignment of your pointers so you can see that it is first performing an alignment test to determine whether it can take the fast path and do 128 bits at a time, or the slow path and do 16 bits at a time.
Additionally, the reason you are finding the code repeated is because the compiler is applying an inlining optimisation. You could disable this with the __attribute((noinline)) spec but if performance is your goal, let the compiler inline it.
If you specify the __restrict keyword then the compiler will only generate the fast-path code: https://goo.gl/g3jUfQ
However, this does not mean the compiler is going to magically take care of alignment for you so take care of what you pass to the function.

Related

Halide: X86 assembly code generation

I'm new in Halide. I am trying to compile camera_pipe application from the source code (https://github.com/halide/Halide/tree/master/apps/camera_pipe). I have successfully compiled camera_pipe.cpp. It generates "curved.s" assembly code.
# Lfunc_begin0:
.loc 3 12 0
#/data/nfs_home/akafi/Halide_CoreIR/src/runtime/posix_allocator.cpp:12:0
.cfi_startproc
#BB#0:
pushq %rbp
.Ltmp0:
.cfi_def_cfa_offset 16
.Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
.Ltmp2:
.cfi_def_cfa_register %rbp
#DEBUG_VALUE: default_malloc:user_context <- %RDI
#DEBUG_VALUE: default_malloc:x <- %RSI
.Ltmp3:
#DEBUG_VALUE: default_malloc:alignment <- 128
.loc 3 15 27 prologue_end
#/data/nfs_home/akafi/Halide_CoreIR/src/runtime/posix_allocator.cpp:15:27
subq $-128, %rsi
.Ltmp4:
.loc 3 15 18 is_stmt 0
# /data/nfs_home/akafi/Halide_CoreIR/src/runtime/posix_allocator.cpp:15:18
movq %rsi, %rdi
.Ltmp5:
callq malloc#PLT
movq %rax, %rcx
.Ltmp6:
#DEBUG_VALUE: default_malloc:orig <- %RCX
xorl %eax, %eax
.loc 3 16 14 is_stmt 1
# /data/nfs_home/akafi/Halide_CoreIR/src/runtime/posix_allocator.cpp:16:14
.Ltmp7:
testq %rcx, %rcx
je .LBB0_2
.Ltmp8:
# BB#1:
#DEBUG_VALUE: default_malloc:orig <- %RCX
.loc 3 21 68
# data/nfs_home/akafi/Halide_CoreIR/src/runtime/posix_allocator.cpp:21:68
movq %rcx, %rax
addq $135, %rax
......
......
I have tried to debug the source code. I found that he "camera_pipe.cpp" called the "/Halide_CoreIR/src/CodeGen_X86.cpp".
The generated assembly doesn't look like X86 assembly. Then what is the fuction of "CodeGen_X86.cpp"?

It sounds like you may be building using a very old Halide tree—for quite a while there has not any file camera_pipe.cpp, the generated output is not called curved.*, etc.
That said, the x86 backend in CodeGen_X86.cpp does generate x86 code. The curved.s you posted is x86_64 assembly.

Why does this C++ function produce so many branch mispredictions?

Let A be an array that contains an odd number of zeros and ones. If n is the size of A, then A is constructed such that the first ceil(n/2) elements are 0 and the remaining elements 1.
So if n = 9, A would look like this:
0,0,0,0,0,1,1,1,1
The goal is to find the sum of 1s in the array and we do this by using this function:
s = 0;
void test1(int curIndex){
//A is 0,0,0,...,0,1,1,1,1,1...,1
if(curIndex == ceil(n/2)) return;
if(A[curIndex] == 1) return;
test1(curIndex+1);
test1(size-curIndex-1);
s += A[curIndex+1] + A[size-curIndex-1];
}
This function is rather silly for the problem given, but it's a simulation of a different function that I want to look like this and is producing the same amount of branch mispredictions.
Here is the entire code of the experiment:
#include <iostream>
#include <fstream>
using namespace std;
int size;
int *A;
int half;
int s;
void test1(int curIndex){
//A is 0,0,0,...,0,1,1,1,1,1...,1
if(curIndex == half) return;
if(A[curIndex] == 1) return;
test1(curIndex+1);
test1(size - curIndex - 1);
s += A[curIndex+1] + A[size-curIndex-1];
}
int main(int argc, char* argv[]){
size = atoi(argv[1]);
if(argc!=2){
cout<<"type ./executable size{odd integer}"<<endl;
return 1;
}
if(size%2!=1){
cout<<"size must be an odd number"<<endl;
return 1;
}
A = new int[size];
half = size/2;
int i;
for(i=0;i<=half;i++){
A[i] = 0;
}
for(i=half+1;i<size;i++){
A[i] = 1;
}
for(i=0;i<100;i++) {
test1(0);
}
cout<<s<<endl;
return 0;
}
Compile by typing g++ -O3 -std=c++11 file.cpp and run by typing ./executable size{odd integer}.
I am using an Intel(R) Core(TM) i5-3470 CPU # 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.
Running perf stat -B -e branches,branch-misses ./cachetests 111111 gives me the following:
Performance counter stats for './cachetests 111111':
32,639,932 branches
1,404,836 branch-misses # 4.30% of all branches
0.060349641 seconds time elapsed
if I remove the line
s += A[curIndex+1] + A[size-curIndex-1];
I get the following output from perf:
Performance counter stats for './cachetests 111111':
24,079,109 branches
39,078 branch-misses # 0.16% of all branches
0.027679521 seconds time elapsed
What does that line have to do with branch predictions when it's not even an if statement?
The way I see it, in the first ceil(n/2) - 1 calls of test1(), both if statements will be false. In the ceil(n/2)-th call, if(curIndex == ceil(n/2)) will be true. In the remaining n-ceil(n/2) calls, the first statement will be false, and the second statement will be true.
Why does Intel fail to predict such a simple behavior?
Now let's look at a second case. Suppose that A now has alternating zeros and ones. We will always start from 0. So if n = 9 A will look like this:
0,1,0,1,0,1,0,1,0
The function we are going to use is the following:
void test2(int curIndex){
//A is 0,1,0,1,0,1,0,1,....
if(curIndex == size-1) return;
if(A[curIndex] == 1) return;
test2(curIndex+1);
test2(curIndex+2);
s += A[curIndex+1] + A[curIndex+2];
}
And here is the entire code of the experiment:
#include <iostream>
#include <fstream>
using namespace std;
int size;
int *A;
int s;
void test2(int curIndex){
//A is 0,1,0,1,0,1,0,1,....
if(curIndex == size-1) return;
if(A[curIndex] == 1) return;
test2(curIndex+1);
test2(curIndex+2);
s += A[curIndex+1] + A[curIndex+2];
}
int main(int argc, char* argv[]){
size = atoi(argv[1]);
if(argc!=2){
cout<<"type ./executable size{odd integer}"<<endl;
return 1;
}
if(size%2!=1){
cout<<"size must be an odd number"<<endl;
return 1;
}
A = new int[size];
int i;
for(i=0;i<size;i++){
if(i%2==0){
A[i] = false;
}
else{
A[i] = true;
}
}
for(i=0;i<100;i++) {
test2(0);
}
cout<<s<<endl;
return 0;
}
I run perf using the same commands as before:
Performance counter stats for './cachetests2 111111':
28,560,183 branches
54,204 branch-misses # 0.19% of all branches
0.037134196 seconds time elapsed
And removing that line again improved things a little bit:
Performance counter stats for './cachetests2 111111':
28,419,557 branches
16,636 branch-misses # 0.06% of all branches
0.009977772 seconds time elapsed
Now if we analyse the function, if(curIndex == size-1) will be false n-1 times, and if(A[curIndex] == 1) will alternate from true to false.
As I see it, both functions should be easy to predict, however this is not the case for the first function. At the same time I am not sure what is happening with that line and why it plays a role in improving branch behavior.

Here are my thoughts on this after staring at it for a while. First of all,
the issue is easily reproducible with -O2, so it's better to use that as a
reference, as it generates simple non-unrolled code that is easy to
analyse. The problem with -O3 is essentially the same, it's just a bit less obvious.
So, for the first case (half-zeros with half-ones pattern) the compiler
generates this code:
0000000000400a80 <_Z5test1i>:
400a80: 55 push %rbp
400a81: 53 push %rbx
400a82: 89 fb mov %edi,%ebx
400a84: 48 83 ec 08 sub $0x8,%rsp
400a88: 3b 3d 0e 07 20 00 cmp 0x20070e(%rip),%edi #
60119c <half>
400a8e: 74 4f je 400adf <_Z5test1i+0x5f>
400a90: 48 8b 15 09 07 20 00 mov 0x200709(%rip),%rdx #
6011a0 <A>
400a97: 48 63 c7 movslq %edi,%rax
400a9a: 48 8d 2c 85 00 00 00 lea 0x0(,%rax,4),%rbp
400aa1: 00
400aa2: 83 3c 82 01 cmpl $0x1,(%rdx,%rax,4)
400aa6: 74 37 je 400adf <_Z5test1i+0x5f>
400aa8: 8d 7f 01 lea 0x1(%rdi),%edi
400aab: e8 d0 ff ff ff callq 400a80 <_Z5test1i>
400ab0: 89 df mov %ebx,%edi
400ab2: f7 d7 not %edi
400ab4: 03 3d ee 06 20 00 add 0x2006ee(%rip),%edi #
6011a8 <size>
400aba: e8 c1 ff ff ff callq 400a80 <_Z5test1i>
400abf: 8b 05 e3 06 20 00 mov 0x2006e3(%rip),%eax #
6011a8 <size>
400ac5: 48 8b 15 d4 06 20 00 mov 0x2006d4(%rip),%rdx #
6011a0 <A>
400acc: 29 d8 sub %ebx,%eax
400ace: 48 63 c8 movslq %eax,%rcx
400ad1: 8b 44 2a 04 mov 0x4(%rdx,%rbp,1),%eax
400ad5: 03 44 8a fc add -0x4(%rdx,%rcx,4),%eax
400ad9: 01 05 b9 06 20 00 add %eax,0x2006b9(%rip) #
601198 <s>
400adf: 48 83 c4 08 add $0x8,%rsp
400ae3: 5b pop %rbx
400ae4: 5d pop %rbp
400ae5: c3 retq
400ae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
400aed: 00 00 00
Very simple, kind of what you would expect -- two conditional branches, two
calls. It gives us this (or similar) statistics on Core 2 Duo T6570, AMD
Phenom II X4 925 and Core i7-4770:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
45,216,754 branches
5,588,484 branch-misses # 12.36% of all branches
0.098535791 seconds time elapsed
If you're to make this change, moving assignment before recursive calls:
--- file.cpp.orig 2016-09-22 22:59:20.744678438 +0300
+++ file.cpp 2016-09-22 22:59:36.492583925 +0300
## -15,10 +15,10 ##
if(curIndex == half) return;
if(A[curIndex] == 1) return;
+ s += A[curIndex+1] + A[size-curIndex-1];
test1(curIndex+1);
test1(size - curIndex - 1);
- s += A[curIndex+1] + A[size-curIndex-1];
}
The picture changes:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
39,495,804 branches
54,430 branch-misses # 0.14% of all branches
0.039522259 seconds time elapsed
And yes, as was already noted it's directly related to tail recursion
optimisation, because if you're to compile the patched code with
-fno-optimize-sibling-calls you will get the same "bad" results. So let's
look at what do we have in assembly with tail call optimization:
0000000000400a80 <_Z5test1i>:
400a80: 3b 3d 16 07 20 00 cmp 0x200716(%rip),%edi #
60119c <half>
400a86: 53 push %rbx
400a87: 89 fb mov %edi,%ebx
400a89: 74 5f je 400aea <_Z5test1i+0x6a>
400a8b: 48 8b 05 0e 07 20 00 mov 0x20070e(%rip),%rax #
6011a0 <A>
400a92: 48 63 d7 movslq %edi,%rdx
400a95: 83 3c 90 01 cmpl $0x1,(%rax,%rdx,4)
400a99: 74 4f je 400aea <_Z5test1i+0x6a>
400a9b: 8b 0d 07 07 20 00 mov 0x200707(%rip),%ecx #
6011a8 <size>
400aa1: eb 15 jmp 400ab8 <_Z5test1i+0x38>
400aa3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400aa8: 48 8b 05 f1 06 20 00 mov 0x2006f1(%rip),%rax #
6011a0 <A>
400aaf: 48 63 d3 movslq %ebx,%rdx
400ab2: 83 3c 90 01 cmpl $0x1,(%rax,%rdx,4)
400ab6: 74 32 je 400aea <_Z5test1i+0x6a>
400ab8: 29 d9 sub %ebx,%ecx
400aba: 8d 7b 01 lea 0x1(%rbx),%edi
400abd: 8b 54 90 04 mov 0x4(%rax,%rdx,4),%edx
400ac1: 48 63 c9 movslq %ecx,%rcx
400ac4: 03 54 88 fc add -0x4(%rax,%rcx,4),%edx
400ac8: 01 15 ca 06 20 00 add %edx,0x2006ca(%rip) #
601198 <s>
400ace: e8 ad ff ff ff callq 400a80 <_Z5test1i>
400ad3: 8b 0d cf 06 20 00 mov 0x2006cf(%rip),%ecx #
6011a8 <size>
400ad9: 89 c8 mov %ecx,%eax
400adb: 29 d8 sub %ebx,%eax
400add: 89 c3 mov %eax,%ebx
400adf: 83 eb 01 sub $0x1,%ebx
400ae2: 39 1d b4 06 20 00 cmp %ebx,0x2006b4(%rip) #
60119c <half>
400ae8: 75 be jne 400aa8 <_Z5test1i+0x28>
400aea: 5b pop %rbx
400aeb: c3 retq
400aec: 0f 1f 40 00 nopl 0x0(%rax)
It has four conditional branches with one call. So let's analyse the data
we've got so far.
First of all, what is a branching instruction from the processor perspective? It's any of call, ret, j* (including direct jmp) and loop. call and jmp are a bit unintuitive, but they are crucial to count things correctly.
Overall, we expect this function to be called 11111100 times, one for each
element, that's roughly 11M. In non-tail-call-optimized version we see about
45M branches, initialization in main() is just 111K, all the other things are minor, so the main contribution to this number comes from our function. Our function is call-ed, it evaluates the first je, which is true in all cases except one, then it evaluates the second je, which is true half of the times and then it either calls itself recursively (but we've already counted that the function is invoked 11M times) or returns (as it does after recursive calls. So that's 4 branching instructions per 11M calls, exactly the number we see. Out of these around 5.5M branches are missed, that suggests that these misses all come from one mispredicted instruction, either something that's evaluated 11M times and missed around 50% of the time or something that's evaluated half of the time and missed always.
What do we have in tail-call-optimized version? We have the function called
around 5.5M times, but now each invocation incurs one call, two branches initially (first one is true in all cases except one and the second one is always false because of our data), then a jmp, then a call (but we've already counted that we have 5.5M calls), then a branch at 400ae8 and a branch at 400ab6 (always true because of our data), then return. So, on average that's four conditional branches, one unconditional jump, a call and one indirect branch (return from function), 5.5M times 7 gives us an overall count of around 39M branches, exactly as we see in the perf output.
What we know is that the processor has no problem at all predicting things in a flow with one function call (even though this version has more conditional branches) and it has problems with two function calls. So it suggests that the problem is in returns from the function.
Unfortunately, we know very little about the details of how exactly branch
predictors of our modern processors work. The best analysis that I could find
is this and it suggests that the processors have a return stack buffer of around 16 entries. If we're to return to our data again with this finding at hand things start to clarify a bit.
When you have half-zeroes with half-ones pattern, you're recursing very
deeply into test1(curIndex+1), but then you start returning back and
calling test1(size-curIndex-1). That recursion is never deeper than one
call, so the returns are predicted perfectly for it. But remember that we're
now 55555 invocations deep and the processor only remembers last 16, so it's
not surprising that it can't guess our returns starting from 55539-level deep,
it's more surprising that it can do so with tail-call-optimized version.
Actually, the behaviour of tail-call-optimized version suggests that missing
any other information about returns, the processor just assumes that the right
one is the last one seen. It's also proven by the behaviour of
non-tail-call-optimized version, because it goes 55555 calls deep into the
test1(curIndex+1) and then upon return it always gets one level deep into
test1(size-curIndex-1), so when we're up from 55555-deep to 55539-deep (or
whatever your processor return buffer is) it calls into
test1(size-curIndex-1), returns from that and it has absolutely no
information about the next return, so it assumes that we're to return to the
last seen address (which is the address to return to from
test1(size-curIndex-1)) and it's obviously wrong. 55539 times wrong. With
100 cycles of the function, that's exactly the 5.5M branch prediction misses
we see.
Now let's get to your alternating pattern and the code for that. This code is
actually very different, if you're to analyse how it goes into the
depth. Here you have your test2(curIndex+1) always return immediately and
your test2(curIndex+2) to always go deeper. So the returns from
test2(curIndex+1) are always predicted perfectly (they just don't go deep
enough) and when we're to finish our recursion into test2(curIndex+2), it
always returns to the same point, all 55555 times, so the processor has no
problems with that.
This can further be proven by this little change to your original half-zeroes with half-ones code:
--- file.cpp.orig 2016-09-23 11:00:26.917977032 +0300
+++ file.cpp 2016-09-23 11:00:31.946027451 +0300
## -15,8 +15,8 ##
if(curIndex == half) return;
if(A[curIndex] == 1) return;
- test1(curIndex+1);
test1(size - curIndex - 1);
+ test1(curIndex+1);
s += A[curIndex+1] + A[size-curIndex-1];
So now the code generated is still not tail-call optimized (assembly-wise it's very similar to the original), but you get something like this in the perf output:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
45 308 579 branches
75 927 branch-misses # 0,17% of all branches
0,026271402 seconds time elapsed
As expected, now our first call always returns immediately and the second call goes 55555-deep and then only returns to the same point.
Now with that solved let me show something up my sleeve. On one system, and
that is Core i5-5200U the non-tail-call-optimized original half-zeroes with half-ones version shows this results:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
45 331 670 branches
16 349 branch-misses # 0,04% of all branches
0,043351547 seconds time elapsed
So, apparently, Broadwell can handle this pattern easily, which returns us to
the question of how much do we know about branch prediction logic of our
modern processors.

Removing the line s += A[curIndex+1] + A[size-curIndex-1]; enables tail recursive optimization.
This optimization can only happen then the recursive call is in the last line of the function.
https://en.wikipedia.org/wiki/Tail_call

Interestingly, in the first execution you have about 30% more branches than in the second execution (32M branches vs 24 Mbranches).
I have generated the assembly code for your application using gcc 4.8.5 and the same flags (plus -S) and there is a significant difference between the assemblies. The code with the conflicting statement is about 572 lines while the code without the same statement is only 409 lines. Focusing on the symbol _Z5test1i -- the decorated C++ name for test1), the routine is 367 lines long while the second case occupies only 202 lines. From all those lines, the first case contains 36 branches (plus 15 call instructions) and the second case contains 34 branches (plus 1 call instruction).
It is also interesting that compiling the application with -O1 does not expose this divergence between the two versions (although the branch mispredict is higher, approx 12%). Using -O2 shows a difference between the two versions (12% vs 3% of branch mispredicts).
I'm not a compiler expert to understand the control flows and logics used by the compiler but it looks like the compiler is able to achieve smarter optimizations (maybe including tail recursive optimizations as pointed out by user1850903 in his answer) when that portion of the code is not present.

the problem is this:
if(A[curIndex] == 1) return;
each call of the test function is alternating the result of this comparison, due to some optimizations, since the array is, for example 0,0,0,0,0,1,1,1,1
In other words:
curIndex = 0 -> A[0] = 0
test1(curIndex + 1) -> curIndex = 1 -> A[1] = 0
But then, the processor architecture MIGHT (a big might, cause it depends; for me that optimization is disabled - an i5-6400) have a feature called runahead (performed along branch prediction), which executes the remaining instructions in the pipeline before entering a branch; so it will execute test1(size - curIndex -1) before the offending if statement.
When removing the attribution, then it enters another optimization, as user1850903 said.

The following piece of code is tail-recursive: the last line of the function doesn't require a call, simply a branch to the point where the function begins using the first argument:
void f(int i) {
if (i == size) break;
s += a[i];
f(i + 1);
}
However, if we break this and make it non-tail recursive:
void f(int i) {
if (i == size) break;
f(i + 1);
s += a[i];
}
There are a number of reasons why the compiler can't deduce the latter to be tail-recursive, but in the example you've given,
test(A[N]);
test(A[M]);
s += a[N] + a[M];
the same rules apply. The compiler can't determine that this is tail recursive, but more so it can't do it because of the two calls (see before and after).
What you appear to be expecting the compiler to do with this is a function which performs a couple of simple conditional branches, two calls and some load/add/stores.
Instead, the compiler is unrolling this loop and generating code which has a lot of branch points. This is done partly because the compiler believes it will be more efficient this way (involving less branches) but partly because it decreases the runtime recursion depth.
int size;
int* A;
int half;
int s;
void test1(int curIndex){
if(curIndex == half || A[curIndex] == 1) return;
test1(curIndex+1);
test1(size-curIndex-1);
s += A[curIndex+1] + A[size-curIndex-1];
}
produces:
test1(int):
movl half(%rip), %edx
cmpl %edi, %edx
je .L36
pushq %r15
pushq %r14
movslq %edi, %rcx
pushq %r13
pushq %r12
leaq 0(,%rcx,4), %r12
pushq %rbp
pushq %rbx
subq $24, %rsp
movq A(%rip), %rax
cmpl $1, (%rax,%rcx,4)
je .L1
leal 1(%rdi), %r13d
movl %edi, %ebp
cmpl %r13d, %edx
je .L42
cmpl $1, 4(%rax,%r12)
je .L42
leal 2(%rdi), %ebx
cmpl %ebx, %edx
je .L39
cmpl $1, 8(%rax,%r12)
je .L39
leal 3(%rdi), %r14d
cmpl %r14d, %edx
je .L37
cmpl $1, 12(%rax,%r12)
je .L37
leal 4(%rdi), %edi
call test1(int)
movl %r14d, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movl %ecx, %esi
movl 16(%rax,%r12), %edx
subl %r14d, %esi
movslq %esi, %rsi
addl -4(%rax,%rsi,4), %edx
addl %edx, s(%rip)
movl half(%rip), %edx
.L10:
movl %ecx, %edi
subl %ebx, %edi
leal -1(%rdi), %r14d
cmpl %edx, %r14d
je .L38
movslq %r14d, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r15
je .L38
call test1(int)
movl %r14d, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movl %ecx, %edx
movl 4(%rax,%r15), %esi
movl %ecx, %edi
subl %r14d, %edx
subl %ebx, %edi
movslq %edx, %rdx
addl -4(%rax,%rdx,4), %esi
movl half(%rip), %edx
addl s(%rip), %esi
movl %esi, s(%rip)
.L13:
movslq %edi, %rdi
movl 12(%rax,%r12), %r8d
addl -4(%rax,%rdi,4), %r8d
addl %r8d, %esi
movl %esi, s(%rip)
.L7:
movl %ecx, %ebx
subl %r13d, %ebx
leal -1(%rbx), %r14d
cmpl %edx, %r14d
je .L41
movslq %r14d, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r15
je .L41
cmpl %edx, %ebx
je .L18
movslq %ebx, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r8
movq %r8, (%rsp)
je .L18
leal 1(%rbx), %edi
call test1(int)
movl %ebx, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movq (%rsp), %r8
movl %ecx, %esi
subl %ebx, %esi
movl 4(%rax,%r8), %edx
movslq %esi, %rsi
addl -4(%rax,%rsi,4), %edx
addl %edx, s(%rip)
movl half(%rip), %edx
.L18:
movl %ecx, %edi
subl %r14d, %edi
leal -1(%rdi), %ebx
cmpl %edx, %ebx
je .L40
movslq %ebx, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r8
je .L40
movq %r8, (%rsp)
call test1(int)
movl %ebx, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movq (%rsp), %r8
movl %ecx, %edx
movl %ecx, %edi
subl %ebx, %edx
movl 4(%rax,%r8), %esi
subl %r14d, %edi
movslq %edx, %rdx
addl -4(%rax,%rdx,4), %esi
movl half(%rip), %edx
addl s(%rip), %esi
movl %esi, %r8d
movl %esi, s(%rip)
.L20:
movslq %edi, %rdi
movl 4(%rax,%r15), %esi
movl %ecx, %ebx
addl -4(%rax,%rdi,4), %esi
subl %r13d, %ebx
addl %r8d, %esi
movl %esi, s(%rip)
.L16:
movslq %ebx, %rbx
movl 8(%rax,%r12), %edi
addl -4(%rax,%rbx,4), %edi
addl %edi, %esi
movl %esi, s(%rip)
jmp .L4
.L45:
movl s(%rip), %edx
.L23:
movslq %ebx, %rbx
movl 4(%rax,%r12), %ecx
addl -4(%rax,%rbx,4), %ecx
addl %ecx, %edx
movl %edx, s(%rip)
.L1:
addq $24, %rsp
popq %rbx
popq %rbp
popq %r12
popq %r13
popq %r14
popq %r15
.L36:
rep ret
.L42:
movl size(%rip), %ecx
.L4:
movl %ecx, %ebx
subl %ebp, %ebx
leal -1(%rbx), %r14d
cmpl %edx, %r14d
je .L45
movslq %r14d, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r15
je .L45
cmpl %edx, %ebx
je .L25
movslq %ebx, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r13
je .L25
leal 1(%rbx), %esi
cmpl %edx, %esi
movl %esi, (%rsp)
je .L26
cmpl $1, 8(%rax,%r15)
je .L26
leal 2(%rbx), %edi
call test1(int)
movl (%rsp), %esi
movl %esi, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movl (%rsp), %esi
movq A(%rip), %rax
movl %ecx, %edx
subl %esi, %edx
movslq %edx, %rsi
movl 12(%rax,%r15), %edx
addl -4(%rax,%rsi,4), %edx
addl %edx, s(%rip)
movl half(%rip), %edx
.L26:
movl %ecx, %edi
subl %ebx, %edi
leal -1(%rdi), %esi
cmpl %edx, %esi
je .L43
movslq %esi, %r8
cmpl $1, (%rax,%r8,4)
leaq 0(,%r8,4), %r9
je .L43
movq %r9, 8(%rsp)
movl %esi, (%rsp)
call test1(int)
movl (%rsp), %esi
movl %esi, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movl (%rsp), %esi
movq A(%rip), %rax
movq 8(%rsp), %r9
movl %ecx, %edx
movl %ecx, %edi
subl %esi, %edx
movl 4(%rax,%r9), %esi
subl %ebx, %edi
movslq %edx, %rdx
addl -4(%rax,%rdx,4), %esi
movl half(%rip), %edx
addl s(%rip), %esi
movl %esi, s(%rip)
.L28:
movslq %edi, %rdi
movl 4(%rax,%r13), %r8d
addl -4(%rax,%rdi,4), %r8d
addl %r8d, %esi
movl %esi, s(%rip)
.L25:
movl %ecx, %r13d
subl %r14d, %r13d
leal -1(%r13), %ebx
cmpl %edx, %ebx
je .L44
movslq %ebx, %rdi
cmpl $1, (%rax,%rdi,4)
leaq 0(,%rdi,4), %rsi
movq %rsi, (%rsp)
je .L44
cmpl %edx, %r13d
je .L33
movslq %r13d, %rdx
cmpl $1, (%rax,%rdx,4)
leaq 0(,%rdx,4), %r8
movq %r8, 8(%rsp)
je .L33
leal 1(%r13), %edi
call test1(int)
movl %r13d, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rdi
movq 8(%rsp), %r8
movl %ecx, %edx
subl %r13d, %edx
movl 4(%rdi,%r8), %eax
movslq %edx, %rdx
addl -4(%rdi,%rdx,4), %eax
addl %eax, s(%rip)
.L33:
subl %ebx, %ecx
leal -1(%rcx), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movl %ecx, %esi
movl %ecx, %r13d
subl %ebx, %esi
movq (%rsp), %rbx
subl %r14d, %r13d
movslq %esi, %rsi
movl 4(%rax,%rbx), %edx
addl -4(%rax,%rsi,4), %edx
movl s(%rip), %esi
addl %edx, %esi
movl %esi, s(%rip)
.L31:
movslq %r13d, %r13
movl 4(%rax,%r15), %edx
subl %ebp, %ecx
addl -4(%rax,%r13,4), %edx
movl %ecx, %ebx
addl %esi, %edx
movl %edx, s(%rip)
jmp .L23
.L44:
movl s(%rip), %esi
jmp .L31
.L39:
movl size(%rip), %ecx
jmp .L7
.L41:
movl s(%rip), %esi
jmp .L16
.L43:
movl s(%rip), %esi
jmp .L28
.L38:
movl s(%rip), %esi
jmp .L13
.L37:
movl size(%rip), %ecx
jmp .L10
.L40:
movl s(%rip), %r8d
jmp .L20
s:
half:
.zero 4
A:
.zero 8
size:
.zero 4
For the alternating values case, assuming size == 7:
test1(curIndex = 0)
{
if (curIndex == size - 1) return; // false x1
if (A[curIndex] == 1) return; // false x1
test1(curIndex + 1 => 1) {
if (curIndex == size - 1) return; // false x2
if (A[curIndex] == 1) return; // false x1 -mispred-> returns
}
test1(curIndex + 2 => 2) {
if (curIndex == size - 1) return; // false x 3
if (A[curIndex] == 1) return; // false x2
test1(curIndex + 1 => 3) {
if (curIndex == size - 1) return; // false x3
if (A[curIndex] == 1) return; // false x2 -mispred-> returns
}
test1(curIndex + 2 => 4) {
if (curIndex == size - 1) return; // false x4
if (A[curIndex] == 1) return; // false x3
test1(curIndex + 1 => 5) {
if (curIndex == size - 1) return; // false x5
if (A[curIndex] == 1) return; // false x3 -mispred-> returns
}
test1(curIndex + 2 => 6) {
if (curIndex == size - 1) return; // false x5 -mispred-> returns
}
s += A[5] + A[6];
}
s += A[3] + A[4];
}
s += A[1] + A[2];
}
And lets imagine a case where
size = 11;
A[11] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0 };
test1(0)
-> test1(1)
-> test1(2)
-> test1(3) -> returns because 1
-> test1(4)
-> test1(5)
-> test1(6)
-> test1(7) -- returns because 1
-> test1(8)
-> test1(9) -- returns because 1
-> test1(10) -- returns because size-1
-> test1(7) -- returns because 1
-> test1(6)
-> test1(7)
-> test1(8)
-> test1(9) -- 1
-> test1(10) -- size-1
-> test1(3) -> returns
-> test1(2)
... as above
or
size = 5;
A[5] = { 0, 0, 0, 0, 1 };
test1(0)
-> test1(1)
-> test1(2)
-> test1(3)
-> test1(4) -- size
-> test1(5) -- UB
-> test1(4)
-> test1(3)
-> test1(4) -- size
-> test1(5) -- UB
-> test1(2)
..
The two cases you've singled out (alternating and half-pattern) are optimal extremes and the compiler has picked some intermediate case that it will try to handle best.

Add up all elements of compile-time sized array most efficiently

I'm trying to efficiently add everything up in an compile-time sized array, using least amount of instructions. Naturally I'm using templates. I created this.
template<unsigned int startIndex, unsigned int count>
int AddCollapseArray(int theArray[])
{
if(count == 1)
{
return theArray[startIndex];
}
else if(count == 2)
{
return theArray[startIndex] + theArray[startIndex + 1];
}
else if(count % 2 == 0)
{
return AddCollapseArray<startIndex, count / 2>(theArray) + AddCollapseArray<startIndex + count / 2, count / 2>(theArray));
}
else if (count % 2 == 1)
{
int newCount = count-1;
return AddCollapseArray<startIndex, newCount/ 2>(theArray) + AddCollapseArray<startIndex + newCount/ 2, newCount/ 2>(theArray)) + theArray[startIndex + newCount];
}
}
This appears like it will get the job done most efficiently to me. I think the branching and the arithmetic besides the additions will be completely optimized out. Are there any flaws with doing it this way?

Don't try to outsmart the optimizer. All this complicated template machinery just makes it harder for the optimizer to understand what you actually want to do.
For example,
int f0(int *p) {
return AddCollapseArray<0, 10>(p);
}
int f1(int *p) {
return std::accumulate(p+0, p+10, 0);
}
Produces the exact same assembly with clang at -O3
f0(int*): # #f0(int*)
movl 4(%rdi), %eax
addl (%rdi), %eax
addl 8(%rdi), %eax
addl 12(%rdi), %eax
addl 16(%rdi), %eax
addl 20(%rdi), %eax
addl 24(%rdi), %eax
addl 28(%rdi), %eax
addl 32(%rdi), %eax
addl 36(%rdi), %eax
retq
f1(int*): # #f1(int*)
movl 4(%rdi), %eax
addl (%rdi), %eax
addl 8(%rdi), %eax
addl 12(%rdi), %eax
addl 16(%rdi), %eax
addl 20(%rdi), %eax
addl 24(%rdi), %eax
addl 28(%rdi), %eax
addl 32(%rdi), %eax
addl 36(%rdi), %eax
retq
Let's say we want to do 100 elements:
int f0(int *p) {
return AddCollapseArray<0, 100>(p);
}
int f1(int *p) {
return std::accumulate(p+0, p+100, 0);
}
Here's what we get:
f0(int*): # #f0(int*)
pushq %rbp
pushq %rbx
pushq %rax
movq %rdi, %rbx
callq int AddCollapseArray<0u, 50u>(int*)
movl %eax, %ebp
movq %rbx, %rdi
callq int AddCollapseArray<50u, 50u>(int*)
addl %ebp, %eax
addq $8, %rsp
popq %rbx
popq %rbp
retq
f1(int*): # #f1(int*)
movdqu (%rdi), %xmm0
movdqu 16(%rdi), %xmm1
movdqu 32(%rdi), %xmm2
movdqu 48(%rdi), %xmm3
paddd %xmm0, %xmm1
paddd %xmm2, %xmm1
paddd %xmm3, %xmm1
movdqu 64(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 80(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 96(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 112(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 128(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 144(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 160(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 176(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 192(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 208(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 224(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 240(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 256(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 272(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 288(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 304(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 320(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 336(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 352(%rdi), %xmm0
paddd %xmm1, %xmm0
movdqu 368(%rdi), %xmm1
paddd %xmm0, %xmm1
movdqu 384(%rdi), %xmm0
paddd %xmm1, %xmm0
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
paddd %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3]
paddd %xmm1, %xmm0
movd %xmm0, %eax
retq
int AddCollapseArray<0u, 50u>(int*): # #int AddCollapseArray<0u, 50u>(int*)
movl 4(%rdi), %eax
addl (%rdi), %eax
addl 8(%rdi), %eax
addl 12(%rdi), %eax
addl 16(%rdi), %eax
addl 20(%rdi), %eax
addl 24(%rdi), %eax
addl 28(%rdi), %eax
addl 32(%rdi), %eax
addl 36(%rdi), %eax
addl 40(%rdi), %eax
addl 44(%rdi), %eax
addl 48(%rdi), %eax
addl 52(%rdi), %eax
addl 56(%rdi), %eax
addl 60(%rdi), %eax
addl 64(%rdi), %eax
addl 68(%rdi), %eax
addl 72(%rdi), %eax
addl 76(%rdi), %eax
addl 80(%rdi), %eax
addl 84(%rdi), %eax
addl 88(%rdi), %eax
addl 92(%rdi), %eax
addl 96(%rdi), %eax
addl 100(%rdi), %eax
addl 104(%rdi), %eax
addl 108(%rdi), %eax
addl 112(%rdi), %eax
addl 116(%rdi), %eax
addl 120(%rdi), %eax
addl 124(%rdi), %eax
addl 128(%rdi), %eax
addl 132(%rdi), %eax
addl 136(%rdi), %eax
addl 140(%rdi), %eax
addl 144(%rdi), %eax
addl 148(%rdi), %eax
addl 152(%rdi), %eax
addl 156(%rdi), %eax
addl 160(%rdi), %eax
addl 164(%rdi), %eax
addl 168(%rdi), %eax
addl 172(%rdi), %eax
addl 176(%rdi), %eax
addl 180(%rdi), %eax
addl 184(%rdi), %eax
addl 188(%rdi), %eax
addl 192(%rdi), %eax
addl 196(%rdi), %eax
retq
int AddCollapseArray<50u, 50u>(int*): # #int AddCollapseArray<50u, 50u>(int*)
movl 204(%rdi), %eax
addl 200(%rdi), %eax
addl 208(%rdi), %eax
addl 212(%rdi), %eax
addl 216(%rdi), %eax
addl 220(%rdi), %eax
addl 224(%rdi), %eax
addl 228(%rdi), %eax
addl 232(%rdi), %eax
addl 236(%rdi), %eax
addl 240(%rdi), %eax
addl 244(%rdi), %eax
addl 248(%rdi), %eax
addl 252(%rdi), %eax
addl 256(%rdi), %eax
addl 260(%rdi), %eax
addl 264(%rdi), %eax
addl 268(%rdi), %eax
addl 272(%rdi), %eax
addl 276(%rdi), %eax
addl 280(%rdi), %eax
addl 284(%rdi), %eax
addl 288(%rdi), %eax
addl 292(%rdi), %eax
addl 296(%rdi), %eax
addl 300(%rdi), %eax
addl 304(%rdi), %eax
addl 308(%rdi), %eax
addl 312(%rdi), %eax
addl 316(%rdi), %eax
addl 320(%rdi), %eax
addl 324(%rdi), %eax
addl 328(%rdi), %eax
addl 332(%rdi), %eax
addl 336(%rdi), %eax
addl 340(%rdi), %eax
addl 344(%rdi), %eax
addl 348(%rdi), %eax
addl 352(%rdi), %eax
addl 356(%rdi), %eax
addl 360(%rdi), %eax
addl 364(%rdi), %eax
addl 368(%rdi), %eax
addl 372(%rdi), %eax
addl 376(%rdi), %eax
addl 380(%rdi), %eax
addl 384(%rdi), %eax
addl 388(%rdi), %eax
addl 392(%rdi), %eax
addl 396(%rdi), %eax
retq
Not only is your function not fully inlined, it's also not vectorized. GCC produces similar results.

The important qualifier here is the meaning of "least number of instructions". If that is to be interpreted as causing the CPU to perform the fewest steps, and we further stipulate there are no advanced techniques to be employed, like SIMD, GPU programming or OMP (or other auto parallel technologies)....just C or C++, then consider:
Assuming something like:
int a[ 10 ];
Which is filled with data at runtime, and will always contain 10 entries (0 through 9)
The std::accumulate does a nice job here, creating a tight loop in the assembler, no mess...just quick:
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
If course, some const int signifying the size of the array 'a' would be in order.
This compares curiously to:
for( int n=0; n < 10; ++n ) r += a[ n ];
The compiler very smartly emits 10 add instructions unrolled - it doesn't even bother with a loop.
Now, this means that in std::accumulate, though the loop is tight, there will be, at the minimum, two add instructions for each element (one for the sum, and one to increment the iterator). Add to that the comparison instruction and a conditional jump, and there are at least 4 instructions per item, or about 40 machine language steps of various cost in ticks.
On the other hand, the unrolled result of the for loop is just 10 machine steps, which the CPU can very likely schedule with great cache friendliness, and no jumps.
The for loop is definitely faster.
The compiler "knows" what you're trying to do, and gets to the job as well as you might think through it with the proposed code you posted.
Further, if the size of the array gets too outlandish for unrolling the loop, the compiler automatically performs the classic optimization that std::accumulate does not appear to do for some reason...i.e., performing two additions per loop (when it constructs a loop for reason of the number of elements).
Using VC 2012, this source:
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
int z = 0;
int *ap = a;
int *ae = &a[9];
while( ap <= ae ) { z += *ap; ++ap; }
int z2 = 0;
for (int n=0; n < 10; ++n ) z2 += a[ n ];
Produces the following assembler snippets on a release build in VC2012
int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
00301270 33 D2 xor edx,edx
00301272 B8 D4 40 30 00 mov eax,3040D4h
00301277 EB 07 jmp wmain+10h (0301280h)
00301279 8D A4 24 00 00 00 00 lea esp,[esp]
00301280 03 10 add edx,dword ptr [eax]
00301282 83 C0 04 add eax,4
00301285 3D F8 40 30 00 cmp eax,3040F8h
0030128A 75 F4 jne wmain+10h (0301280h)
while( ap <= ae ) { z += *ap; ++ap; }
003012A0 03 08 add ecx,dword ptr [eax]
003012A2 03 70 04 add esi,dword ptr [eax+4]
003012A5 83 C0 08 add eax,8
003012A8 3D F4 40 30 00 cmp eax,3040F4h
003012AD 7E F1 jle wmain+30h (03012A0h)
003012AF 3D F8 40 30 00 cmp eax,3040F8h
003012B4 77 02 ja wmain+48h (03012B8h)
003012B6 8B 38 mov edi,dword ptr [eax]
003012B8 8D 04 0E lea eax,[esi+ecx]
003012BB 03 F8 add edi,eax
for (int n=0; n < 10; ++n ) z2 += a[ n ];
003012BD A1 D4 40 30 00 mov eax,dword ptr ds:[003040D4h]
003012C2 03 05 F8 40 30 00 add eax,dword ptr ds:[3040F8h]
003012C8 03 05 D8 40 30 00 add eax,dword ptr ds:[3040D8h]
003012CE 03 05 DC 40 30 00 add eax,dword ptr ds:[3040DCh]
003012D4 03 05 E0 40 30 00 add eax,dword ptr ds:[3040E0h]
003012DA 03 05 E4 40 30 00 add eax,dword ptr ds:[3040E4h]
003012E0 03 05 E8 40 30 00 add eax,dword ptr ds:[3040E8h]
003012E6 03 05 EC 40 30 00 add eax,dword ptr ds:[3040ECh]
003012EC 03 05 F0 40 30 00 add eax,dword ptr ds:[3040F0h]
003012F2 03 05 F4 40 30 00 add eax,dword ptr ds:[3040F4h]
Based on comments I decided to try this in XCode 7, with drastically different results. This is it's unroll of the for loop:
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a(%rip), %rax
Ltmp22:
##DEBUG_VALUE: do3:z2 <- EAX
movq %rax, %rcx
shrq $32, %rcx
.loc 1 58 33 is_stmt 0 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
addl %eax, %ecx
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+8(%rip), %rax
Ltmp23:
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %eax, %edx
addl %ecx, %edx
shrq $32, %rax
addl %edx, %eax
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+16(%rip), %rcx
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %ecx, %edx
addl %eax, %edx
shrq $32, %rcx
addl %edx, %ecx
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+24(%rip), %rax
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %eax, %edx
addl %ecx, %edx
shrq $32, %rax
addl %edx, %eax
.loc 1 58 36 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
movq _a+32(%rip), %rcx
.loc 1 58 33 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
movl %ecx, %edx
addl %eax, %edx
shrq $32, %rcx
addl %edx, %ecx
This may not look as clean as VC's simple list, but it may run as fast because the setup (movq or movl) for each addition may run parallel in the CPU as the previous entry is finishing it's addition, costing little to nothing by comparison to VC's simple, clean 'looking' series of adds on memory sources.
The following is Xcode's std::accumulator. It SEEMS there's a init required, but then it performs a clean series of additions having unrolled the loop, which VC did not do.
.file 37 "/Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" "numeric"
.loc 37 75 27 is_stmt 1 ## /Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/numeric:75:27
movq _a(%rip), %r14
Ltmp11:
movq %r14, -48(%rbp) ## 8-byte Spill
Ltmp12:
shrq $32, %r14
movq _a+8(%rip), %rbx
movq %rbx, -56(%rbp) ## 8-byte Spill
shrq $32, %rbx
movq _a+16(%rip), %r13
movq %r13, -72(%rbp) ## 8-byte Spill
shrq $32, %r13
movq _a+24(%rip), %r15
movq %r15, %r12
shrq $32, %r12
Ltmp13:
movl _a+32(%rip), %eax
Ltmp14:
movq -48(%rbp), %rax ## 8-byte Reload
addl %eax, %r14d
movq -56(%rbp), %rax ## 8-byte Reload
addl %eax, %r14d
addl %ebx, %r14d
movq -72(%rbp), %rax ## 8-byte Reload
addl %eax, %r14d
addl %r13d, %r14d
addl %r15d, %r14d
addl %r12d, %r14d
addl -64(%rbp), %r14d ## 4-byte Folded Reload
The bottom line here is that the optimizations we rely upon from compilers differs so widely and wildly from one compiler to another that we should rely upon them, but watch.
LLVM is quite exemplary, and understands std::accumulate better than VC, it seems - but this short investigation can't reveal if that is a difference in the implementation of the libary or of the compiler. There could be important differences in the implementation of Xcode's std::accumulate which give the compiler more insight than VC's version of the library.
That applies more generally to algorithms, even those from numeric. std::accumulate is a for loop. It is likely expanded inline as for loop based on pointers into the array, which is why VC's choice to create a loop for std::accumulate was echoed in it's choice to produce a loop for the code using int * to loop through the array, but unrolled the loop for the for loop using an integer to reference entries in the array by index. In other words, it really did no better in a straight for loop when pointers were used, and that suggests it's VC's optimizer, not the library, in this case.
This follows Stroustrup's own favorite example of the idea of information available to the compiler, comparing qsort from C and sort from C++. qsort takes a function pointer to perform the comparison, cutting off the compiler from understand the comparison, forcing it to call a function via a pointer. The C++ sort function, on the other hand, takes a functor, which conveys more information about the comparison. That could still result in a function call, but the optimizer has the opportunity to understand the comparison sufficiently to make it inline.
In VC's case, for whatever reason (we'd have to as Microsoft), the compiler is confused when looping through the array via pointers. The information given to it is different than with the loop using an integer to index the array. It understands that, but not the pointers. LLVM, by contrast, understood both (and more). The difference of information is not important to LLVM, but it is to VC. Since std::accumulate is really an inline representing a for loop, and that loop is processed via pointers, it escaped VC's recognition, just as VC did in the straight for loop based on pointers. If a specialization could be made for integer arrays, such that accumulated looped with indexes rather than pointers, VC would respond with better output, but it shouldn't be so.
A poor optimizer can miss the point, and a poor implementation of the library could confuse the optimizer, which means that under the best circumstances std::accumulate can perform about as well as the for loop for a simple array of integers, producing an unrolled version of the loop creating the sum, but not always. However, there's little to get in the way of the compiler's understanding in a for loop..everything is right there, and the implementation of the library can't mess it up, it's all up to the compiler at that point. For that, VC shows it's weakness.
I tried all settings on VC to try to get it to unroll std::accumulate, but so far it never did (haven't tried newer versions of VC).
It didn't take much to get Xcode to unroll the loop; LLVM seems to have deeper engineering. It may have a better implementation of the library, too.
Incidentally, the C code example I posted at top was used in VC, which didn't recognize that the three different summations were related. LLVM on XCode did, which meant the first time I tried it there it simply adopted the answer from std::accumulate and did nothing otherwise. VC was really weak on that point. In order to get Xcode to perform 3 separate tests, I randomized the array before each call...otherwise Xcode realized what I was doing where VC did not.

Whereas std::accumulate should be enough, to unroll manually the loop, you may do
namespace detail
{
template<std::size_t startIndex, std::size_t... Is>
int Accumulate(std::index_sequence<Is...>, const int a[])
{
int res = 0;
const int dummy[] = {0, ((res += a[startIndex + Is]), 0)...};
static_cast<void>(dummy); // Remove warning for unused variable
return res;
}
}
template<std::size_t startIndex, std::size_t count>
int AddCollapseArray(const int a[])
{
return detail::Accumulate<startIndex>(std::make_index_sequence<count>{}, a);
}
or in C++17, with fold expression:
namespace detail
{
template<std::size_t startIndex, std::size_t... Is>
int Accumulate(std::index_sequence<Is...>, const int a[])
{
return (a[startIndex + Is] + ...);
}
}

Able to change value of const in C, but not in C++

Consider the following code
#include <stdio.h>
#include <string.h>
main()
{
const int a = 2;
long p = (long)&a;
int *c = (int *)p;
*c =3;
printf("%d", a);
}
This code can change the value to a in C but not in C++. I understand that C++ is applying optimization and replacing instances of a with 2. So was this a bug fix in C++ or was the bug fixed by chance due to optimization?

It's undefined behavior to modify a const value no matter directly or indirectly. This may compile in C and may even run without problem on your machine, but it's still undefined behavior.
The difference between C and C++ on this is: with const int a = 2, C++ treats a as a constant expression, for instance, you can use a as array dimension:
int n[a]; //fine in C++
But in C, a is not a constant expression, with the same code:
int n[a]; //VLA in C99
Here n is not a fixed-sized array, but a variable length array.

This is not a C vs C++ issue. By modifying a const value (as well as by double-casting a pointer via a long), you enter the realm of undefined behaviour in both languages. Therefore the difference is simply a matter of how the undefined behaviour chooses to manifest itself.

You are casting away the constness out of &a and modifying the pointed value, which is undefined behavior both in C and in C++ (the trip through long just adds some more gratuitous UB). In C++ your compiler happens to optimize more aggressively the constant, but the point of the situation is unchanged.

Your code generates undefined behavior on C++ since you're accessing memory you shouldn't
include <stdio.h>
#include <string.h>
void main()
{
const int a = 2;
printf("%x != %x !!", sizeof(long), sizeof(void*)); // on a x64 system 4 != 8
long p = (long)&a;
int *c = (int *)p;
*c =3;
printf("%d", a);
}
and even if it works on a 32 bit system modifying const memory by casting away the constness is undefined behavior in both languages.

Following is the assembly code generated by g++. The compiler statically use "$2" instead of "a", but in case of gcc it doesn't perform any static optimization. I guess there shouldn't be any undefined behaviour.
.Ltext0:
.section .rodata
.LC0:
0000 256400 .string "%d"
.text
.globl main
main:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA0
0000 55 pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
0001 4889E5 movq %rsp, %rbp
.cfi_def_cfa_register 6
0004 4883EC20 subq $32, %rsp
.LBB2:
0008 C745EC02 movl $2, -20(%rbp)
000000
000f 488D45EC leaq -20(%rbp), %rax
0013 488945F0 movq %rax, -16(%rbp)
0017 488B45F0 movq -16(%rbp), %rax
001b 488945F8 movq %rax, -8(%rbp)
001f 488B45F8 movq -8(%rbp), %rax
0023 C7000300 movl $3, (%rax)
0000
0029 488B45F8 movq -8(%rbp), %rax
002d 8B00 movl (%rax), %eax
002f 89C6 movl %eax, %esi
0031 BF000000 movl $.LC0, %edi
00
0036 B8000000 movl $0, %eax
00
.LEHB0:
003b E8000000 call printf
00
0040 BE020000 movl $2, %esi
00
0045 BF000000 movl $.LC0, %edi
00
004a B8000000 movl $0, %eax
00
004f E8000000 call printf
00
.LEHE0:
0054 B8000000 movl $0, %eax
00
0059 EB08 jmp .L5
.L4:
005b 4889C7 movq %rax, %rdi
.LEHB1:
005e E8000000 call _Unwind_Resume
00
.LEHE1:
.L5:
.LBE2:
0063 C9 leave
.cfi_def_cfa 7, 8
0064 C3 ret
.cfi_endproc
.LFE0:
.globl __gxx_personality_v0
.section .gcc_except_table,"a",#progbits
.LLSDA0:
0000 FF .byte 0xff
0001 FF .byte 0xff
0002 01 .byte 0x1
0003 08 .uleb128 .LLSDACSE0-.LLSDACSB0
.LLSDACSB0:
0004 3B .uleb128 .LEHB0-.LFB0
0005 19 .uleb128 .LEHE0-.LEHB0
0006 5B .uleb128 .L4-.LFB0
0007 00 .uleb128 0
0008 5E .uleb128 .LEHB1-.LFB0
0009 05 .uleb128 .LEHE1-.LEHB1
000a 00 .uleb128 0
000b 00 .uleb128 0
.LLSDACSE0:
.text
.Letext0:

Efficient data structure(s) for kind of scripting application?

In C++ I want to write an application that works similar to a scripting language:
Out of some input during "setup time" it will define on a big global array where each variable will be located and on a different array the sequence of the functions ("LogicElement") to call (including their parameters like the variables to use).
One implementation might look like:
class LogicElement_Generic
{
public:
virtual void calc() const = 0;
};
class LogicElement_Mul : public LogicElement_Generic
{
int &to;
const int &from1;
const int &from2;
public:
LogicElement_Mul( int &_to, const int &_from1, const int &_from2 ) : to(_to), from1(_from1), from2(_from2)
{}
void calc() const
{
to = from1 * from2;
}
};
char globalVariableBuffer[1000]; // a simple binary buffer
LogicElement_Generic *le[10];
int main( void )
{
// just a demo, this would be setup from e.g. an input file:
int *to = (int*)globalVariableBuffer;
int *from1 = (int*)(globalVariableBuffer + sizeof(int));
int *from2 = (int*)(globalVariableBuffer + 2*sizeof(int));
*from1 = 2;
*from2 = 3;
le[0] = new LogicElement_Mul( *to, *from1, *from2 );
// doing all calculations:
// finally it would be a loop iterating over all calculation functions,
// over and over again - the area in the code where all the resources
// would be burned...
le[0]->calc();
return *to;
}
Although that works as intended, looking at the created assembly:
78 .section .text._ZNK16LogicElement_Mul4calcEv,"axG",#progbits,_ZNK16LogicElement_Mul4calcEv,comdat
79 .align 2
80 .weak _ZNK16LogicElement_Mul4calcEv
82 _ZNK16LogicElement_Mul4calcEv:
83 .LFB6:
17:.../src/test.cpp **** void calc() const
84 .loc 1 17 0
85 .cfi_startproc
86 0000 55 pushq %rbp
87 .LCFI6:
88 .cfi_def_cfa_offset 16
89 .cfi_offset 6, -16
90 0001 4889E5 movq %rsp, %rbp
91 .LCFI7:
92 .cfi_def_cfa_register 6
93 0004 48897DF8 movq %rdi, -8(%rbp)
18:.../src/test.cpp **** {
19:.../src/test.cpp **** to = from1 * from2;
94 .loc 1 19 0
95 0008 488B45F8 movq -8(%rbp), %rax
96 000c 488B4008 movq 8(%rax), %rax
97 0010 488B55F8 movq -8(%rbp), %rdx
98 0014 488B5210 movq 16(%rdx), %rdx
99 0018 8B0A movl (%rdx), %ecx
100 001a 488B55F8 movq -8(%rbp), %rdx
101 001e 488B5218 movq 24(%rdx), %rdx
102 0022 8B12 movl (%rdx), %edx
103 0024 0FAFD1 imull %ecx, %edx
104 0027 8910 movl %edx, (%rax)
20:.../src/test.cpp **** }
105 .loc 1 20 0
106 0029 5D popq %rbp
107 .LCFI8:
108 .cfi_def_cfa 7, 8
109 002a C3 ret
110 .cfi_endproc
Looking at the assembly lines 95 .. 104 you can see that for each variable three indirections are used.
As this part of the code (the calc() methods) would finally be called very rapidly I want to use the least CPU cycles and memory bandwidth as possible (by general C/C++).
I also want to achieve (not shown in the code above) to have two variable buffers that have exactly the same layout to be able to do double buffering at an multithreaded approach to limit the necessary locks (exact implementation details would be too much detail for this question).
So the big questions are:
How can I change the architecture to reduce the amount of memory indirections in the calc()?
(I'd expect only two: one to get the offset address in the variable array and an additional to get the variable itself - but my experiments changing the code above to use offsets made things far worse!)
Is there a better way to set up the classes and thus the array of the LogicElements so that calling the calculation methods will use the least amount of resources?

Thanks to the hint by #Ed S. I changed away from references (where I hoped the compiler could optimize better).
But an even more important step I did was to compare the assembly that was generated after activating optimizations (just a simple -O2 did do).
(I didn't do that at the beginning as I wanted to have a clearer picture on the generated "pure" machine code and not one where an intelligent compiler fixes a stupid programmer - but it seems the compiler is too "stupid" then...)
So the current result is quite good now for the variable array:
class LogicElement_Generic
{
public:
virtual void calc(void * const base) const = 0;
};
class LogicElement_Mul : public LogicElement_Generic
{
int const to;
int const from1;
int const from2;
public:
LogicElement_Mul( int const _to, int const _from1, int const _from2 ) : to(_to), from1(_from1), from2(_from2)
{}
void calc(void * const base) const
{
*((int*)(base+to)) = *((int*)(base+from1)) * *((int*)(base+from2));
}
};
char globalVariableBuffer[1000]; // a simple binary buffer
LogicElement_Generic *le[10];
int main( void )
{
int to = 0;
int from1 = sizeof(int);
int from2 = 2*sizeof(int);
*((int*)(globalVariableBuffer+from1)) = 2;
*((int*)(globalVariableBuffer+from2)) = 3;
le[0] = new LogicElement_Mul( to, from1, from2 );
le[0]->calc(globalVariableBuffer);
return *((int*)(globalVariableBuffer+to));
}
with the relevant part of the assembly:
17:.../src/test.cpp **** void calc(void * const base) const
12 .loc 1 17 0
13 .cfi_startproc
14 .LVL0:
18:.../src/test.cpp **** {
19:.../src/test.cpp **** *((int*)(base+to)) = *((int*)(base+from1)) * *((int*)(base+from2));
15 .loc 1 19 0
16 0000 4863470C movslq 12(%rdi), %rax
17 0004 48634F10 movslq 16(%rdi), %rcx
18 0008 48635708 movslq 8(%rdi), %rdx
19 000c 8B0406 movl (%rsi,%rax), %eax
20 000f 0FAF040E imull (%rsi,%rcx), %eax
21 0013 890416 movl %eax, (%rsi,%rdx)
20:.../src/test.cpp **** }
22 .loc 1 20 0
23 0016 C3 ret
24 .cfi_endproc
So I recon the first questions as answered! :)
The second is still open.
(Even more now as the pointer arithmetic might be valid C++ - but very ugly...)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GCC generates undesired assembly code - c++

Related

Halide: X86 assembly code generation

Why does this C++ function produce so many branch mispredictions?

Add up all elements of compile-time sized array most efficiently

Able to change value of const in C, but not in C++

Efficient data structure(s) for kind of scripting application?

Categories

Resources