Basic block in LLVM IR seems broken in assembly codes

Basic block in LLVM IR seems broken in assembly codes - llvm

I add debug info to the first and last instruction of the basic blocks by llvm pass, and then I successfully find the info I add in the assembly codes. But the numbers of info of first instruction and last instruction are different.
I want to know if the result is correct, and if it is correct, why the numbers are different and how to correctly get the boundary of block in assembly code?
The code of my pass
bool runOnFunction(Function &F) override {
unsigned start = 2333;
unsigned end = 23333;
MDNode *N = F.getMetadata("dbg");
for (BasicBlock &B : F) {
errs() << "Hello: ";
Instruction *I = B.getFirstNonPHI();
DebugLoc loc = DebugLoc::get(start, 0, N);
if (loc && I!=NULL) {
I->setDebugLoc(loc);
} else {
errs() << "start error";
}
I = B.getTerminator();
loc = DebugLoc::get(end, 1, N);
if (loc && I!= NULL) {
I->setDebugLoc(loc);
} else {
errs() << "end error";
}
errs() << "\n";
}
return true;
}
};
}
It runs without any errors. Some of the result:
# %bb.2: # %for.body
# in Loop: Header=BB0_1 Depth=1
.loc 1 2333 0 # myls.c:2333:0
movl -4(%ebp), %eax
.Ltmp4:
.loc 1 71 14 # myls.c:71:14
movl %eax, -8(%ebp)
.Ltmp5:
.LBB0_3: # %for.cond1
# Parent Loop BB0_1 Depth=1
# => This Inner Loop Header: Depth=2
.loc 1 2333 0 # myls.c:2333:0
movl -8(%ebp), %eax
.Ltmp6:
.loc 1 71 18 # myls.c:71:18
cmpl n, %eax
.Ltmp7:
.loc 1 23333 1 # myls.c:23333:1
jge .LBB0_13
......
LBB2_10: # %sw.epilog
.loc 1 23333 1 # myls.c:23333:1
jmp .LBB2_11
.LBB2_11: # %while.cond
# =>This Inner Loop Header: Depth=1
.loc 1 162 13 # myls.c:162:13
cmpl $0, -164(%ebp)
.loc 1 23333 1 # myls.c:23333:1
jl .LBB2_20
# %bb.12: # %while.body
# in Loop: Header=BB2_11 Depth=1
.Ltmp75:
.loc 1 164 14 # myls.c:164:14
movl -136(%ebp), %eax
.loc 1 164 28 is_stmt 0 # myls.c:164:28
movl -164(%ebp), %ecx
# kill: def $cl killed $ecx
.loc 1 164 25 # myls.c:164:25
movl $1, %edx
shll %cl, %edx
.loc 1 164 22 # myls.c:164:22
andl %edx, %eax
cmpl $0, %eax
.Ltmp76:
.loc 1 23333 1 is_stmt 1 # myls.c:23333:1
je .LBB2_18
I find the 2333 and 23333 don't match, and the numbers of 2333 and 23333 are different in assembly code for different architectures. I use opt to run my pass and llc to get the assembly code.
I appreciate every help.

I think the problem is with optimization. After your pass, the IR goes through other rounds of optimization that changed the basic blocks and eliminated some instructions.
You probably have a register function like the following from LLVM page
static llvm::RegisterStandardPasses Y(
llvm::PassManagerBuilder::EP_EarlyAsPossible,
[](const llvm::PassManagerBuilder &Builder,
llvm::legacy::PassManagerBase &PM) { PM.add(new Hello()); });
Try use EP_OptimizerLast rather than EP_EarlyAsPossible. That will run your pass as the last optimizer.
The other option is to use EP_EnabledOnOptLevel0 and use opt with -O0.
You can look up the flags in this page
It might be helpful to use -emit-llvm to generate LLVM IR in *.ll. It is more visible what your code has done to the IR

Related

Am I correctly reasoning about cache performance?

I'm trying to solidify my understanding of data contention and have come up with the following minimal test program. It runs a thread that does some data crunching, and spinlocks on an atomic bool until the thread is done.
#include <iostream>
#include <atomic>
#include <thread>
#include <array>
class SideProcessor
{
public:
SideProcessor()
: dataArr{}
{
};
void start() {
processingRequested = true;
};
bool isStarted() {
return processingRequested;
};
bool isDone() {
return !processingRequested;
};
void finish() {
processingRequested = false;
}
void run(std::atomic<bool>* exitRequested) {
while(!(*exitRequested)) {
// Spin on the flag.
while(!(isStarted()) && !(*exitRequested)) {
}
if (*exitRequested) {
// If we were asked to spin down, break out of the loop.
break;
}
// Process
processData();
// Flag that we're done.
finish();
}
};
private:
std::atomic<bool> processingRequested;
#ifdef PAD_ALIGNMENT
std::array<bool, 64> padding;
#endif
std::array<int, 100> dataArr;
void processData() {
// Add 1 to every element a bunch of times.
std::cout << "Processing data." << std::endl;
for (unsigned int i = 0; i < 10000000; ++i) {
for (unsigned int j = 0; j < 100; ++j) {
dataArr[j] += 1;
}
}
std::cout << "Done processing." << std::endl;
};
};
int main() {
std::atomic<bool> exitRequested;
exitRequested = false;
SideProcessor sideProcessor;
std::thread processThreadObj = std::thread(&SideProcessor::run,
&sideProcessor, &exitRequested);
// Spinlock while the thread is doing work.
std::cout << "Starting processing." << std::endl;
sideProcessor.start();
while (!(sideProcessor.isDone())) {
}
// Spin the thread down.
std::cout << "Waiting for thread to spin down." << std::endl;
exitRequested = true;
processThreadObj.join();
std::cout << "Done." << std::endl;
return 0;
}
When I build with -O3 and don't define PAD_ALIGNMENT, I get the following results from Linux perf:
142,511,066 cache-references (29.15%)
142,364 cache-misses # 0.100 % of all cache refs (39.33%)
33,580,965 L1-dcache-load-misses # 3.40% of all L1-dcache hits (39.90%)
988,605,337 L1-dcache-loads (40.46%)
279,446,259 L1-dcache-store (40.71%)
227,713 L1-icache-load-misses (40.71%)
10,040,733 LLC-loads (40.71%)
5,834 LLC-load-misses # 0.06% of all LL-cache hits (40.32%)
40,070,067 LLC-stores (19.39%)
94 LLC-store-misses (19.22%)
0.708704757 seconds time elapsed
When I build with PAD_ALIGNMENT, I get the following results:
450,713 cache-references (27.83%)
124,281 cache-misses # 27.574 % of all cache refs (39.29%)
104,857 L1-dcache-load-misses # 0.01% of all L1-dcache hits (42.16%)
714,361,767 L1-dcache-loads (45.02%)
281,140,925 L1-dcache-store (45.83%)
90,839 L1-icache-load-misses (43.52%)
11,225 LLC-loads (40.66%)
3,685 LLC-load-misses # 32.83% of all LL-cache hits (37.80%)
1,798 LLC-stores (17.18%)
76 LLC-store-misses (17.18%)
0.140602005 seconds time elapsed
I have 2 questions:
Would I be correct in saying that the huge amount of increased cache references come from missing L1 and having to go to LLC to get the cache line that the other core invalidated? (please correct my terminology if it isn't accurate)
I think I understand the increased LLC-loads, but why are there so many more LLC-stores when the data is on the same cache line?
(Edit) Additional information as requested:
g++ version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CPU model: Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
Kernel version: 4.15.0-106-generic
processData()'s loop when compiling with g++ -std=c++14 -pthread -g -O3 -c -fverbose-asm -Wa,-adhln:
Without PAD_ALIGNMENT:
57:main.cpp **** void processData() {
58:main.cpp **** // Add 1 to every element a bunch of times.
59:main.cpp **** std::cout << "Processing data." << std::endl;
60:main.cpp **** for (unsigned int i = 0; i < 10000000; ++i) {
61:main.cpp **** for (unsigned int j = 0; j < 100; ++j) {
62:main.cpp **** dataArr[j] += 1;
409 .loc 4 62 0
410 0109 83450401 addl $1, 4(%rbp) #, MEM[(value_type &)this_8(D) + 4]
411 .LVL32:
412 010d 4183FD01 cmpl $1, %r13d #, prolog_loop_niters.140
413 0111 0F841901 je .L29 #,
413 0000
414 0117 83450801 addl $1, 8(%rbp) #, MEM[(value_type &)this_8(D) + 4]
415 .LVL33:
416 011b 4183FD02 cmpl $2, %r13d #, prolog_loop_niters.140
417 011f 0F842801 je .L30 #,
417 0000
418 0125 83450C01 addl $1, 12(%rbp) #, MEM[(value_type &)this_8(D) + 4]
419 .LVL34:
420 0129 BF610000 movl $97, %edi #, ivtmp_12
420 00
421 # main.cpp:61: for (unsigned int j = 0; j < 100; ++j) {
61:main.cpp **** dataArr[j] += 1;
422 .loc 4 61 0
423 012e 41BA0300 movl $3, %r10d #, j
423 0000
424 .LVL35:
425 .L18:
426 0134 31C0 xorl %eax, %eax # ivtmp.156
427 0136 31D2 xorl %edx, %edx # ivtmp.153
428 0138 0F1F8400 .p2align 4,,10
428 00000000
429 .p2align 3
430 .L20:
431 0140 83C201 addl $1, %edx #, ivtmp.153
432 # main.cpp:62: dataArr[j] += 1;
433 .loc 4 62 0
434 0143 66410F6F movdqa (%r14,%rax), %xmm0 # MEM[base: vectp_this.148_118, index: ivtmp.156_48, offset: 0], vect__2
434 0406
435 0149 660FFEC1 paddd %xmm1, %xmm0 # tmp231, vect__25.150
436 014d 410F2904 movaps %xmm0, (%r14,%rax) # vect__25.150, MEM[base: vectp_this.148_118, index: ivtmp.156_48, offse
436 06
437 0152 4883C010 addq $16, %rax #, ivtmp.156
438 0156 4439FA cmpl %r15d, %edx # bnd.143, ivtmp.153
439 0159 72E5 jb .L20 #,
440 015b 4429E7 subl %r12d, %edi # niters_vector_mult_vf.144, ivtmp_12
441 015e 4439E3 cmpl %r12d, %ebx # niters_vector_mult_vf.144, niters.142
442 0161 438D0422 leal (%r10,%r12), %eax #, tmp.145
443 0165 89FA movl %edi, %edx # ivtmp_12, tmp.146
444 0167 7421 je .L21 #,
445 .LVL36:
446 0169 89C7 movl %eax, %edi # tmp.145, tmp.145
447 016b 8344BD04 addl $1, 4(%rbp,%rdi,4) #, MEM[(value_type &)_129 + 4]
447 01
448 # main.cpp:61: for (unsigned int j = 0; j < 100; ++j) {
61:main.cpp **** dataArr[j] += 1;
449 .loc 4 61 0
450 0170 83FA01 cmpl $1, %edx #, tmp.146
451 0173 8D7801 leal 1(%rax), %edi #,
452 .LVL37:
453 0176 7412 je .L21 #,
454 # main.cpp:62: dataArr[j] += 1;
455 .loc 4 62 0
456 0178 8344BD04 addl $1, 4(%rbp,%rdi,4) #, MEM[(value_type &)_122 + 4]
456 01
457 # main.cpp:61: for (unsigned int j = 0; j < 100; ++j) {
61:main.cpp **** dataArr[j] += 1;
458 .loc 4 61 0
459 017d 83C002 addl $2, %eax #,
460 .LVL38:
461 0180 83FA02 cmpl $2, %edx #, tmp.146
462 0183 7405 je .L21 #,
463 # main.cpp:62: dataArr[j] += 1;
464 .loc 4 62 0
465 0185 83448504 addl $1, 4(%rbp,%rax,4) #, MEM[(value_type &)_160 + 4]
465 01
466 .LVL39:
467 .L21:
468 .LBE930:
469 # main.cpp:60: for (unsigned int i = 0; i < 10000000; ++i) {
60:main.cpp **** for (unsigned int j = 0; j < 100; ++j) {
470 .loc 4 60 0
471 018a 83EE01 subl $1, %esi #, ivtmp_3
472 018d 0F856DFF jne .L22 #,
472 FFFF
With PAD_ALIGNMENT:
57:main.cpp **** void processData() {
58:main.cpp **** // Add 1 to every element a bunch of times.
59:main.cpp **** std::cout << "Processing data." << std::endl;
60:main.cpp **** for (unsigned int i = 0; i < 10000000; ++i) {
61:main.cpp **** for (unsigned int j = 0; j < 100; ++j) {
62:main.cpp **** dataArr[j] += 1;
410 .loc 4 62 0
411 0109 83454401 addl $1, 68(%rbp) #, MEM[(value_type &)this_8(D) + 68]
412 .LVL32:
413 010d 4183FD01 cmpl $1, %r13d #, prolog_loop_niters.140
414 0111 0F841901 je .L29 #,
414 0000
415 0117 83454801 addl $1, 72(%rbp) #, MEM[(value_type &)this_8(D) + 68]
416 .LVL33:
417 011b 4183FD02 cmpl $2, %r13d #, prolog_loop_niters.140
418 011f 0F842801 je .L30 #,
418 0000
419 0125 83454C01 addl $1, 76(%rbp) #, MEM[(value_type &)this_8(D) + 68]
420 .LVL34:
421 0129 BF610000 movl $97, %edi #, ivtmp_12
421 00
422 # main.cpp:61: for (unsigned int j = 0; j < 100; ++j) {
61:main.cpp **** dataArr[j] += 1;
423 .loc 4 61 0
424 012e 41BA0300 movl $3, %r10d #, j
424 0000
425 .LVL35:
426 .L18:
427 0134 31C0 xorl %eax, %eax # ivtmp.156
428 0136 31D2 xorl %edx, %edx # ivtmp.153
429 0138 0F1F8400 .p2align 4,,10
429 00000000
430 .p2align 3
431 .L20:
432 0140 83C201 addl $1, %edx #, ivtmp.153
433 # main.cpp:62: dataArr[j] += 1;
434 .loc 4 62 0
435 0143 66410F6F movdqa (%r14,%rax), %xmm0 # MEM[base: vectp_this.148_118, index: ivtmp.156_48, offset: 0], vect__2
435 0406
436 0149 660FFEC1 paddd %xmm1, %xmm0 # tmp231, vect__25.150
437 014d 410F2904 movaps %xmm0, (%r14,%rax) # vect__25.150, MEM[base: vectp_this.148_118, index: ivtmp.156_48, offse
437 06
438 0152 4883C010 addq $16, %rax #, ivtmp.156
439 0156 4439FA cmpl %r15d, %edx # bnd.143, ivtmp.153
440 0159 72E5 jb .L20 #,
441 015b 4429E7 subl %r12d, %edi # niters_vector_mult_vf.144, ivtmp_12
442 015e 4439E3 cmpl %r12d, %ebx # niters_vector_mult_vf.144, niters.142
443 0161 438D0422 leal (%r10,%r12), %eax #, tmp.145
444 0165 89FA movl %edi, %edx # ivtmp_12, tmp.146
445 0167 7421 je .L21 #,
446 .LVL36:
447 0169 89C7 movl %eax, %edi # tmp.145, tmp.145
448 016b 8344BD44 addl $1, 68(%rbp,%rdi,4) #, MEM[(value_type &)_129 + 68]
448 01
449 # main.cpp:61: for (unsigned int j = 0; j < 100; ++j) {
61:main.cpp **** dataArr[j] += 1;
450 .loc 4 61 0
451 0170 83FA01 cmpl $1, %edx #, tmp.146
452 0173 8D7801 leal 1(%rax), %edi #,
453 .LVL37:
454 0176 7412 je .L21 #,
455 # main.cpp:62: dataArr[j] += 1;
456 .loc 4 62 0
457 0178 8344BD44 addl $1, 68(%rbp,%rdi,4) #, MEM[(value_type &)_122 + 68]
457 01
458 # main.cpp:61: for (unsigned int j = 0; j < 100; ++j) {
61:main.cpp **** dataArr[j] += 1;
459 .loc 4 61 0
460 017d 83C002 addl $2, %eax #,
461 .LVL38:
462 0180 83FA02 cmpl $2, %edx #, tmp.146
463 0183 7405 je .L21 #,
464 # main.cpp:62: dataArr[j] += 1;
465 .loc 4 62 0
466 0185 83448544 addl $1, 68(%rbp,%rax,4) #, MEM[(value_type &)_160 + 68]
466 01
467 .LVL39:
468 .L21:
469 .LBE930:
470 # main.cpp:60: for (unsigned int i = 0; i < 10000000; ++i) {
60:main.cpp **** for (unsigned int j = 0; j < 100; ++j) {
471 .loc 4 60 0
472 018a 83EE01 subl $1, %esi #, ivtmp_3
473 018d 0F856DFF jne .L22 #,
473 FFFF

An instance of type SideProcessor has the following fields:
std::atomic<bool> processingRequested;
#ifdef PAD_ALIGNMENT
std::array<bool, 64> padding;
#endif
std::array<int, 100> dataArr;
The size of processingRequested is probably one byte. Without PAD_ALIGNMENT, the compiler will probably arrange the fields such that the first few elements of dataArr are in same 64-byte cache line as processingRequested. However, with PAD_ALIGNMENT, there will be a 64-byte gap between the two fields, so they the first element of the array and processingRequested will be in different cache lines.
Considering the loop in processData in isolation, one would expect that all of the 100 elements of the dataArr to easily fit in the L1D cache and so the vast majority of accesses should hit in the L1D. However, the main thread reads processingRequested in while (!(sideProcessor.isDone())) { } concurrently with the processing thread executing the loop in processData. Without PAD_ALIGNMENT, the main thread wants to read from the same same cache line that the processing thread wants to both read and write. This results in a false sharing situation where the shared cache line repeatedly bounces between the private caches of the two cores on the which the threads are running.
With false sharing between two cores in the same LLC sharing domain, there will be a negligible number of misses by the LLC (it can backstop the requests so they don't go to DRAM), but there will be a lot of read and RFO requests from the two cores. That's why the LLC miss event counts are small.
It appears to me that the compiler has unrolled the loop in processData four times and vectorized it using 16-byte SSE instructions. This would explain why the number of stores is close to a quarter of a billion. Without, the number of loads is about a billion, about a quarter of which is from the processing thread and most of the rest are from the main thread. The number of loads executed in while (!(sideProcessor.isDone())) { } depends on the time it takes to complete the execution of processData. So it makes sense that the number of loads is much smaller in the case of no false sharing (with PAD_ALIGNMENT).
In the case without PAD_ALIGNMENT, most of the L1-dcache-load-misses and LLC-loads events are from requests generated by the main thread while most of the LLC-stores events are from requests generated by the processing thread. All of these requests are to the line containing processingRequested. It makes sense that LLC-stores is much larger than LLC-loads because the main thread accesses the line more rapidly than the processing thread, so it's more likely that the RFOs miss in the private caches of the core on the processing thread is running. I think also most of the L1-dcache-load-misses events represent loads from the main thread to the shared cache line. It looks like only a third of these loads miss in the private L2 cache, which suggests that the line is being prefetched into the L2. This can be verified by disabling the L2 prefetchers and checking whether L1-dcache-load-misses is about equal to LLC-loads.

Why does this C++ function produce so many branch mispredictions?

Let A be an array that contains an odd number of zeros and ones. If n is the size of A, then A is constructed such that the first ceil(n/2) elements are 0 and the remaining elements 1.
So if n = 9, A would look like this:
0,0,0,0,0,1,1,1,1
The goal is to find the sum of 1s in the array and we do this by using this function:
s = 0;
void test1(int curIndex){
//A is 0,0,0,...,0,1,1,1,1,1...,1
if(curIndex == ceil(n/2)) return;
if(A[curIndex] == 1) return;
test1(curIndex+1);
test1(size-curIndex-1);
s += A[curIndex+1] + A[size-curIndex-1];
}
This function is rather silly for the problem given, but it's a simulation of a different function that I want to look like this and is producing the same amount of branch mispredictions.
Here is the entire code of the experiment:
#include <iostream>
#include <fstream>
using namespace std;
int size;
int *A;
int half;
int s;
void test1(int curIndex){
//A is 0,0,0,...,0,1,1,1,1,1...,1
if(curIndex == half) return;
if(A[curIndex] == 1) return;
test1(curIndex+1);
test1(size - curIndex - 1);
s += A[curIndex+1] + A[size-curIndex-1];
}
int main(int argc, char* argv[]){
size = atoi(argv[1]);
if(argc!=2){
cout<<"type ./executable size{odd integer}"<<endl;
return 1;
}
if(size%2!=1){
cout<<"size must be an odd number"<<endl;
return 1;
}
A = new int[size];
half = size/2;
int i;
for(i=0;i<=half;i++){
A[i] = 0;
}
for(i=half+1;i<size;i++){
A[i] = 1;
}
for(i=0;i<100;i++) {
test1(0);
}
cout<<s<<endl;
return 0;
}
Compile by typing g++ -O3 -std=c++11 file.cpp and run by typing ./executable size{odd integer}.
I am using an Intel(R) Core(TM) i5-3470 CPU # 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.
Running perf stat -B -e branches,branch-misses ./cachetests 111111 gives me the following:
Performance counter stats for './cachetests 111111':
32,639,932 branches
1,404,836 branch-misses # 4.30% of all branches
0.060349641 seconds time elapsed
if I remove the line
s += A[curIndex+1] + A[size-curIndex-1];
I get the following output from perf:
Performance counter stats for './cachetests 111111':
24,079,109 branches
39,078 branch-misses # 0.16% of all branches
0.027679521 seconds time elapsed
What does that line have to do with branch predictions when it's not even an if statement?
The way I see it, in the first ceil(n/2) - 1 calls of test1(), both if statements will be false. In the ceil(n/2)-th call, if(curIndex == ceil(n/2)) will be true. In the remaining n-ceil(n/2) calls, the first statement will be false, and the second statement will be true.
Why does Intel fail to predict such a simple behavior?
Now let's look at a second case. Suppose that A now has alternating zeros and ones. We will always start from 0. So if n = 9 A will look like this:
0,1,0,1,0,1,0,1,0
The function we are going to use is the following:
void test2(int curIndex){
//A is 0,1,0,1,0,1,0,1,....
if(curIndex == size-1) return;
if(A[curIndex] == 1) return;
test2(curIndex+1);
test2(curIndex+2);
s += A[curIndex+1] + A[curIndex+2];
}
And here is the entire code of the experiment:
#include <iostream>
#include <fstream>
using namespace std;
int size;
int *A;
int s;
void test2(int curIndex){
//A is 0,1,0,1,0,1,0,1,....
if(curIndex == size-1) return;
if(A[curIndex] == 1) return;
test2(curIndex+1);
test2(curIndex+2);
s += A[curIndex+1] + A[curIndex+2];
}
int main(int argc, char* argv[]){
size = atoi(argv[1]);
if(argc!=2){
cout<<"type ./executable size{odd integer}"<<endl;
return 1;
}
if(size%2!=1){
cout<<"size must be an odd number"<<endl;
return 1;
}
A = new int[size];
int i;
for(i=0;i<size;i++){
if(i%2==0){
A[i] = false;
}
else{
A[i] = true;
}
}
for(i=0;i<100;i++) {
test2(0);
}
cout<<s<<endl;
return 0;
}
I run perf using the same commands as before:
Performance counter stats for './cachetests2 111111':
28,560,183 branches
54,204 branch-misses # 0.19% of all branches
0.037134196 seconds time elapsed
And removing that line again improved things a little bit:
Performance counter stats for './cachetests2 111111':
28,419,557 branches
16,636 branch-misses # 0.06% of all branches
0.009977772 seconds time elapsed
Now if we analyse the function, if(curIndex == size-1) will be false n-1 times, and if(A[curIndex] == 1) will alternate from true to false.
As I see it, both functions should be easy to predict, however this is not the case for the first function. At the same time I am not sure what is happening with that line and why it plays a role in improving branch behavior.

Here are my thoughts on this after staring at it for a while. First of all,
the issue is easily reproducible with -O2, so it's better to use that as a
reference, as it generates simple non-unrolled code that is easy to
analyse. The problem with -O3 is essentially the same, it's just a bit less obvious.
So, for the first case (half-zeros with half-ones pattern) the compiler
generates this code:
0000000000400a80 <_Z5test1i>:
400a80: 55 push %rbp
400a81: 53 push %rbx
400a82: 89 fb mov %edi,%ebx
400a84: 48 83 ec 08 sub $0x8,%rsp
400a88: 3b 3d 0e 07 20 00 cmp 0x20070e(%rip),%edi #
60119c <half>
400a8e: 74 4f je 400adf <_Z5test1i+0x5f>
400a90: 48 8b 15 09 07 20 00 mov 0x200709(%rip),%rdx #
6011a0 <A>
400a97: 48 63 c7 movslq %edi,%rax
400a9a: 48 8d 2c 85 00 00 00 lea 0x0(,%rax,4),%rbp
400aa1: 00
400aa2: 83 3c 82 01 cmpl $0x1,(%rdx,%rax,4)
400aa6: 74 37 je 400adf <_Z5test1i+0x5f>
400aa8: 8d 7f 01 lea 0x1(%rdi),%edi
400aab: e8 d0 ff ff ff callq 400a80 <_Z5test1i>
400ab0: 89 df mov %ebx,%edi
400ab2: f7 d7 not %edi
400ab4: 03 3d ee 06 20 00 add 0x2006ee(%rip),%edi #
6011a8 <size>
400aba: e8 c1 ff ff ff callq 400a80 <_Z5test1i>
400abf: 8b 05 e3 06 20 00 mov 0x2006e3(%rip),%eax #
6011a8 <size>
400ac5: 48 8b 15 d4 06 20 00 mov 0x2006d4(%rip),%rdx #
6011a0 <A>
400acc: 29 d8 sub %ebx,%eax
400ace: 48 63 c8 movslq %eax,%rcx
400ad1: 8b 44 2a 04 mov 0x4(%rdx,%rbp,1),%eax
400ad5: 03 44 8a fc add -0x4(%rdx,%rcx,4),%eax
400ad9: 01 05 b9 06 20 00 add %eax,0x2006b9(%rip) #
601198 <s>
400adf: 48 83 c4 08 add $0x8,%rsp
400ae3: 5b pop %rbx
400ae4: 5d pop %rbp
400ae5: c3 retq
400ae6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
400aed: 00 00 00
Very simple, kind of what you would expect -- two conditional branches, two
calls. It gives us this (or similar) statistics on Core 2 Duo T6570, AMD
Phenom II X4 925 and Core i7-4770:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
45,216,754 branches
5,588,484 branch-misses # 12.36% of all branches
0.098535791 seconds time elapsed
If you're to make this change, moving assignment before recursive calls:
--- file.cpp.orig 2016-09-22 22:59:20.744678438 +0300
+++ file.cpp 2016-09-22 22:59:36.492583925 +0300
## -15,10 +15,10 ##
if(curIndex == half) return;
if(A[curIndex] == 1) return;
+ s += A[curIndex+1] + A[size-curIndex-1];
test1(curIndex+1);
test1(size - curIndex - 1);
- s += A[curIndex+1] + A[size-curIndex-1];
}
The picture changes:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
39,495,804 branches
54,430 branch-misses # 0.14% of all branches
0.039522259 seconds time elapsed
And yes, as was already noted it's directly related to tail recursion
optimisation, because if you're to compile the patched code with
-fno-optimize-sibling-calls you will get the same "bad" results. So let's
look at what do we have in assembly with tail call optimization:
0000000000400a80 <_Z5test1i>:
400a80: 3b 3d 16 07 20 00 cmp 0x200716(%rip),%edi #
60119c <half>
400a86: 53 push %rbx
400a87: 89 fb mov %edi,%ebx
400a89: 74 5f je 400aea <_Z5test1i+0x6a>
400a8b: 48 8b 05 0e 07 20 00 mov 0x20070e(%rip),%rax #
6011a0 <A>
400a92: 48 63 d7 movslq %edi,%rdx
400a95: 83 3c 90 01 cmpl $0x1,(%rax,%rdx,4)
400a99: 74 4f je 400aea <_Z5test1i+0x6a>
400a9b: 8b 0d 07 07 20 00 mov 0x200707(%rip),%ecx #
6011a8 <size>
400aa1: eb 15 jmp 400ab8 <_Z5test1i+0x38>
400aa3: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400aa8: 48 8b 05 f1 06 20 00 mov 0x2006f1(%rip),%rax #
6011a0 <A>
400aaf: 48 63 d3 movslq %ebx,%rdx
400ab2: 83 3c 90 01 cmpl $0x1,(%rax,%rdx,4)
400ab6: 74 32 je 400aea <_Z5test1i+0x6a>
400ab8: 29 d9 sub %ebx,%ecx
400aba: 8d 7b 01 lea 0x1(%rbx),%edi
400abd: 8b 54 90 04 mov 0x4(%rax,%rdx,4),%edx
400ac1: 48 63 c9 movslq %ecx,%rcx
400ac4: 03 54 88 fc add -0x4(%rax,%rcx,4),%edx
400ac8: 01 15 ca 06 20 00 add %edx,0x2006ca(%rip) #
601198 <s>
400ace: e8 ad ff ff ff callq 400a80 <_Z5test1i>
400ad3: 8b 0d cf 06 20 00 mov 0x2006cf(%rip),%ecx #
6011a8 <size>
400ad9: 89 c8 mov %ecx,%eax
400adb: 29 d8 sub %ebx,%eax
400add: 89 c3 mov %eax,%ebx
400adf: 83 eb 01 sub $0x1,%ebx
400ae2: 39 1d b4 06 20 00 cmp %ebx,0x2006b4(%rip) #
60119c <half>
400ae8: 75 be jne 400aa8 <_Z5test1i+0x28>
400aea: 5b pop %rbx
400aeb: c3 retq
400aec: 0f 1f 40 00 nopl 0x0(%rax)
It has four conditional branches with one call. So let's analyse the data
we've got so far.
First of all, what is a branching instruction from the processor perspective? It's any of call, ret, j* (including direct jmp) and loop. call and jmp are a bit unintuitive, but they are crucial to count things correctly.
Overall, we expect this function to be called 11111100 times, one for each
element, that's roughly 11M. In non-tail-call-optimized version we see about
45M branches, initialization in main() is just 111K, all the other things are minor, so the main contribution to this number comes from our function. Our function is call-ed, it evaluates the first je, which is true in all cases except one, then it evaluates the second je, which is true half of the times and then it either calls itself recursively (but we've already counted that the function is invoked 11M times) or returns (as it does after recursive calls. So that's 4 branching instructions per 11M calls, exactly the number we see. Out of these around 5.5M branches are missed, that suggests that these misses all come from one mispredicted instruction, either something that's evaluated 11M times and missed around 50% of the time or something that's evaluated half of the time and missed always.
What do we have in tail-call-optimized version? We have the function called
around 5.5M times, but now each invocation incurs one call, two branches initially (first one is true in all cases except one and the second one is always false because of our data), then a jmp, then a call (but we've already counted that we have 5.5M calls), then a branch at 400ae8 and a branch at 400ab6 (always true because of our data), then return. So, on average that's four conditional branches, one unconditional jump, a call and one indirect branch (return from function), 5.5M times 7 gives us an overall count of around 39M branches, exactly as we see in the perf output.
What we know is that the processor has no problem at all predicting things in a flow with one function call (even though this version has more conditional branches) and it has problems with two function calls. So it suggests that the problem is in returns from the function.
Unfortunately, we know very little about the details of how exactly branch
predictors of our modern processors work. The best analysis that I could find
is this and it suggests that the processors have a return stack buffer of around 16 entries. If we're to return to our data again with this finding at hand things start to clarify a bit.
When you have half-zeroes with half-ones pattern, you're recursing very
deeply into test1(curIndex+1), but then you start returning back and
calling test1(size-curIndex-1). That recursion is never deeper than one
call, so the returns are predicted perfectly for it. But remember that we're
now 55555 invocations deep and the processor only remembers last 16, so it's
not surprising that it can't guess our returns starting from 55539-level deep,
it's more surprising that it can do so with tail-call-optimized version.
Actually, the behaviour of tail-call-optimized version suggests that missing
any other information about returns, the processor just assumes that the right
one is the last one seen. It's also proven by the behaviour of
non-tail-call-optimized version, because it goes 55555 calls deep into the
test1(curIndex+1) and then upon return it always gets one level deep into
test1(size-curIndex-1), so when we're up from 55555-deep to 55539-deep (or
whatever your processor return buffer is) it calls into
test1(size-curIndex-1), returns from that and it has absolutely no
information about the next return, so it assumes that we're to return to the
last seen address (which is the address to return to from
test1(size-curIndex-1)) and it's obviously wrong. 55539 times wrong. With
100 cycles of the function, that's exactly the 5.5M branch prediction misses
we see.
Now let's get to your alternating pattern and the code for that. This code is
actually very different, if you're to analyse how it goes into the
depth. Here you have your test2(curIndex+1) always return immediately and
your test2(curIndex+2) to always go deeper. So the returns from
test2(curIndex+1) are always predicted perfectly (they just don't go deep
enough) and when we're to finish our recursion into test2(curIndex+2), it
always returns to the same point, all 55555 times, so the processor has no
problems with that.
This can further be proven by this little change to your original half-zeroes with half-ones code:
--- file.cpp.orig 2016-09-23 11:00:26.917977032 +0300
+++ file.cpp 2016-09-23 11:00:31.946027451 +0300
## -15,8 +15,8 ##
if(curIndex == half) return;
if(A[curIndex] == 1) return;
- test1(curIndex+1);
test1(size - curIndex - 1);
+ test1(curIndex+1);
s += A[curIndex+1] + A[size-curIndex-1];
So now the code generated is still not tail-call optimized (assembly-wise it's very similar to the original), but you get something like this in the perf output:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
45 308 579 branches
75 927 branch-misses # 0,17% of all branches
0,026271402 seconds time elapsed
As expected, now our first call always returns immediately and the second call goes 55555-deep and then only returns to the same point.
Now with that solved let me show something up my sleeve. On one system, and
that is Core i5-5200U the non-tail-call-optimized original half-zeroes with half-ones version shows this results:
$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500
Performance counter stats for './a.out 111111':
45 331 670 branches
16 349 branch-misses # 0,04% of all branches
0,043351547 seconds time elapsed
So, apparently, Broadwell can handle this pattern easily, which returns us to
the question of how much do we know about branch prediction logic of our
modern processors.

Removing the line s += A[curIndex+1] + A[size-curIndex-1]; enables tail recursive optimization.
This optimization can only happen then the recursive call is in the last line of the function.
https://en.wikipedia.org/wiki/Tail_call

Interestingly, in the first execution you have about 30% more branches than in the second execution (32M branches vs 24 Mbranches).
I have generated the assembly code for your application using gcc 4.8.5 and the same flags (plus -S) and there is a significant difference between the assemblies. The code with the conflicting statement is about 572 lines while the code without the same statement is only 409 lines. Focusing on the symbol _Z5test1i -- the decorated C++ name for test1), the routine is 367 lines long while the second case occupies only 202 lines. From all those lines, the first case contains 36 branches (plus 15 call instructions) and the second case contains 34 branches (plus 1 call instruction).
It is also interesting that compiling the application with -O1 does not expose this divergence between the two versions (although the branch mispredict is higher, approx 12%). Using -O2 shows a difference between the two versions (12% vs 3% of branch mispredicts).
I'm not a compiler expert to understand the control flows and logics used by the compiler but it looks like the compiler is able to achieve smarter optimizations (maybe including tail recursive optimizations as pointed out by user1850903 in his answer) when that portion of the code is not present.

the problem is this:
if(A[curIndex] == 1) return;
each call of the test function is alternating the result of this comparison, due to some optimizations, since the array is, for example 0,0,0,0,0,1,1,1,1
In other words:
curIndex = 0 -> A[0] = 0
test1(curIndex + 1) -> curIndex = 1 -> A[1] = 0
But then, the processor architecture MIGHT (a big might, cause it depends; for me that optimization is disabled - an i5-6400) have a feature called runahead (performed along branch prediction), which executes the remaining instructions in the pipeline before entering a branch; so it will execute test1(size - curIndex -1) before the offending if statement.
When removing the attribution, then it enters another optimization, as user1850903 said.

The following piece of code is tail-recursive: the last line of the function doesn't require a call, simply a branch to the point where the function begins using the first argument:
void f(int i) {
if (i == size) break;
s += a[i];
f(i + 1);
}
However, if we break this and make it non-tail recursive:
void f(int i) {
if (i == size) break;
f(i + 1);
s += a[i];
}
There are a number of reasons why the compiler can't deduce the latter to be tail-recursive, but in the example you've given,
test(A[N]);
test(A[M]);
s += a[N] + a[M];
the same rules apply. The compiler can't determine that this is tail recursive, but more so it can't do it because of the two calls (see before and after).
What you appear to be expecting the compiler to do with this is a function which performs a couple of simple conditional branches, two calls and some load/add/stores.
Instead, the compiler is unrolling this loop and generating code which has a lot of branch points. This is done partly because the compiler believes it will be more efficient this way (involving less branches) but partly because it decreases the runtime recursion depth.
int size;
int* A;
int half;
int s;
void test1(int curIndex){
if(curIndex == half || A[curIndex] == 1) return;
test1(curIndex+1);
test1(size-curIndex-1);
s += A[curIndex+1] + A[size-curIndex-1];
}
produces:
test1(int):
movl half(%rip), %edx
cmpl %edi, %edx
je .L36
pushq %r15
pushq %r14
movslq %edi, %rcx
pushq %r13
pushq %r12
leaq 0(,%rcx,4), %r12
pushq %rbp
pushq %rbx
subq $24, %rsp
movq A(%rip), %rax
cmpl $1, (%rax,%rcx,4)
je .L1
leal 1(%rdi), %r13d
movl %edi, %ebp
cmpl %r13d, %edx
je .L42
cmpl $1, 4(%rax,%r12)
je .L42
leal 2(%rdi), %ebx
cmpl %ebx, %edx
je .L39
cmpl $1, 8(%rax,%r12)
je .L39
leal 3(%rdi), %r14d
cmpl %r14d, %edx
je .L37
cmpl $1, 12(%rax,%r12)
je .L37
leal 4(%rdi), %edi
call test1(int)
movl %r14d, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movl %ecx, %esi
movl 16(%rax,%r12), %edx
subl %r14d, %esi
movslq %esi, %rsi
addl -4(%rax,%rsi,4), %edx
addl %edx, s(%rip)
movl half(%rip), %edx
.L10:
movl %ecx, %edi
subl %ebx, %edi
leal -1(%rdi), %r14d
cmpl %edx, %r14d
je .L38
movslq %r14d, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r15
je .L38
call test1(int)
movl %r14d, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movl %ecx, %edx
movl 4(%rax,%r15), %esi
movl %ecx, %edi
subl %r14d, %edx
subl %ebx, %edi
movslq %edx, %rdx
addl -4(%rax,%rdx,4), %esi
movl half(%rip), %edx
addl s(%rip), %esi
movl %esi, s(%rip)
.L13:
movslq %edi, %rdi
movl 12(%rax,%r12), %r8d
addl -4(%rax,%rdi,4), %r8d
addl %r8d, %esi
movl %esi, s(%rip)
.L7:
movl %ecx, %ebx
subl %r13d, %ebx
leal -1(%rbx), %r14d
cmpl %edx, %r14d
je .L41
movslq %r14d, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r15
je .L41
cmpl %edx, %ebx
je .L18
movslq %ebx, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r8
movq %r8, (%rsp)
je .L18
leal 1(%rbx), %edi
call test1(int)
movl %ebx, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movq (%rsp), %r8
movl %ecx, %esi
subl %ebx, %esi
movl 4(%rax,%r8), %edx
movslq %esi, %rsi
addl -4(%rax,%rsi,4), %edx
addl %edx, s(%rip)
movl half(%rip), %edx
.L18:
movl %ecx, %edi
subl %r14d, %edi
leal -1(%rdi), %ebx
cmpl %edx, %ebx
je .L40
movslq %ebx, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r8
je .L40
movq %r8, (%rsp)
call test1(int)
movl %ebx, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movq (%rsp), %r8
movl %ecx, %edx
movl %ecx, %edi
subl %ebx, %edx
movl 4(%rax,%r8), %esi
subl %r14d, %edi
movslq %edx, %rdx
addl -4(%rax,%rdx,4), %esi
movl half(%rip), %edx
addl s(%rip), %esi
movl %esi, %r8d
movl %esi, s(%rip)
.L20:
movslq %edi, %rdi
movl 4(%rax,%r15), %esi
movl %ecx, %ebx
addl -4(%rax,%rdi,4), %esi
subl %r13d, %ebx
addl %r8d, %esi
movl %esi, s(%rip)
.L16:
movslq %ebx, %rbx
movl 8(%rax,%r12), %edi
addl -4(%rax,%rbx,4), %edi
addl %edi, %esi
movl %esi, s(%rip)
jmp .L4
.L45:
movl s(%rip), %edx
.L23:
movslq %ebx, %rbx
movl 4(%rax,%r12), %ecx
addl -4(%rax,%rbx,4), %ecx
addl %ecx, %edx
movl %edx, s(%rip)
.L1:
addq $24, %rsp
popq %rbx
popq %rbp
popq %r12
popq %r13
popq %r14
popq %r15
.L36:
rep ret
.L42:
movl size(%rip), %ecx
.L4:
movl %ecx, %ebx
subl %ebp, %ebx
leal -1(%rbx), %r14d
cmpl %edx, %r14d
je .L45
movslq %r14d, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r15
je .L45
cmpl %edx, %ebx
je .L25
movslq %ebx, %rsi
cmpl $1, (%rax,%rsi,4)
leaq 0(,%rsi,4), %r13
je .L25
leal 1(%rbx), %esi
cmpl %edx, %esi
movl %esi, (%rsp)
je .L26
cmpl $1, 8(%rax,%r15)
je .L26
leal 2(%rbx), %edi
call test1(int)
movl (%rsp), %esi
movl %esi, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movl (%rsp), %esi
movq A(%rip), %rax
movl %ecx, %edx
subl %esi, %edx
movslq %edx, %rsi
movl 12(%rax,%r15), %edx
addl -4(%rax,%rsi,4), %edx
addl %edx, s(%rip)
movl half(%rip), %edx
.L26:
movl %ecx, %edi
subl %ebx, %edi
leal -1(%rdi), %esi
cmpl %edx, %esi
je .L43
movslq %esi, %r8
cmpl $1, (%rax,%r8,4)
leaq 0(,%r8,4), %r9
je .L43
movq %r9, 8(%rsp)
movl %esi, (%rsp)
call test1(int)
movl (%rsp), %esi
movl %esi, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movl (%rsp), %esi
movq A(%rip), %rax
movq 8(%rsp), %r9
movl %ecx, %edx
movl %ecx, %edi
subl %esi, %edx
movl 4(%rax,%r9), %esi
subl %ebx, %edi
movslq %edx, %rdx
addl -4(%rax,%rdx,4), %esi
movl half(%rip), %edx
addl s(%rip), %esi
movl %esi, s(%rip)
.L28:
movslq %edi, %rdi
movl 4(%rax,%r13), %r8d
addl -4(%rax,%rdi,4), %r8d
addl %r8d, %esi
movl %esi, s(%rip)
.L25:
movl %ecx, %r13d
subl %r14d, %r13d
leal -1(%r13), %ebx
cmpl %edx, %ebx
je .L44
movslq %ebx, %rdi
cmpl $1, (%rax,%rdi,4)
leaq 0(,%rdi,4), %rsi
movq %rsi, (%rsp)
je .L44
cmpl %edx, %r13d
je .L33
movslq %r13d, %rdx
cmpl $1, (%rax,%rdx,4)
leaq 0(,%rdx,4), %r8
movq %r8, 8(%rsp)
je .L33
leal 1(%r13), %edi
call test1(int)
movl %r13d, %edi
notl %edi
addl size(%rip), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rdi
movq 8(%rsp), %r8
movl %ecx, %edx
subl %r13d, %edx
movl 4(%rdi,%r8), %eax
movslq %edx, %rdx
addl -4(%rdi,%rdx,4), %eax
addl %eax, s(%rip)
.L33:
subl %ebx, %ecx
leal -1(%rcx), %edi
call test1(int)
movl size(%rip), %ecx
movq A(%rip), %rax
movl %ecx, %esi
movl %ecx, %r13d
subl %ebx, %esi
movq (%rsp), %rbx
subl %r14d, %r13d
movslq %esi, %rsi
movl 4(%rax,%rbx), %edx
addl -4(%rax,%rsi,4), %edx
movl s(%rip), %esi
addl %edx, %esi
movl %esi, s(%rip)
.L31:
movslq %r13d, %r13
movl 4(%rax,%r15), %edx
subl %ebp, %ecx
addl -4(%rax,%r13,4), %edx
movl %ecx, %ebx
addl %esi, %edx
movl %edx, s(%rip)
jmp .L23
.L44:
movl s(%rip), %esi
jmp .L31
.L39:
movl size(%rip), %ecx
jmp .L7
.L41:
movl s(%rip), %esi
jmp .L16
.L43:
movl s(%rip), %esi
jmp .L28
.L38:
movl s(%rip), %esi
jmp .L13
.L37:
movl size(%rip), %ecx
jmp .L10
.L40:
movl s(%rip), %r8d
jmp .L20
s:
half:
.zero 4
A:
.zero 8
size:
.zero 4
For the alternating values case, assuming size == 7:
test1(curIndex = 0)
{
if (curIndex == size - 1) return; // false x1
if (A[curIndex] == 1) return; // false x1
test1(curIndex + 1 => 1) {
if (curIndex == size - 1) return; // false x2
if (A[curIndex] == 1) return; // false x1 -mispred-> returns
}
test1(curIndex + 2 => 2) {
if (curIndex == size - 1) return; // false x 3
if (A[curIndex] == 1) return; // false x2
test1(curIndex + 1 => 3) {
if (curIndex == size - 1) return; // false x3
if (A[curIndex] == 1) return; // false x2 -mispred-> returns
}
test1(curIndex + 2 => 4) {
if (curIndex == size - 1) return; // false x4
if (A[curIndex] == 1) return; // false x3
test1(curIndex + 1 => 5) {
if (curIndex == size - 1) return; // false x5
if (A[curIndex] == 1) return; // false x3 -mispred-> returns
}
test1(curIndex + 2 => 6) {
if (curIndex == size - 1) return; // false x5 -mispred-> returns
}
s += A[5] + A[6];
}
s += A[3] + A[4];
}
s += A[1] + A[2];
}
And lets imagine a case where
size = 11;
A[11] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0 };
test1(0)
-> test1(1)
-> test1(2)
-> test1(3) -> returns because 1
-> test1(4)
-> test1(5)
-> test1(6)
-> test1(7) -- returns because 1
-> test1(8)
-> test1(9) -- returns because 1
-> test1(10) -- returns because size-1
-> test1(7) -- returns because 1
-> test1(6)
-> test1(7)
-> test1(8)
-> test1(9) -- 1
-> test1(10) -- size-1
-> test1(3) -> returns
-> test1(2)
... as above
or
size = 5;
A[5] = { 0, 0, 0, 0, 1 };
test1(0)
-> test1(1)
-> test1(2)
-> test1(3)
-> test1(4) -- size
-> test1(5) -- UB
-> test1(4)
-> test1(3)
-> test1(4) -- size
-> test1(5) -- UB
-> test1(2)
..
The two cases you've singled out (alternating and half-pattern) are optimal extremes and the compiler has picked some intermediate case that it will try to handle best.

If Else construct

if (a == 1)
//do something
else if (a == 2)
//do something
else if (a == 3)
//do something
else if (a == 4)
//do something
else if (a == 5)
//do something
else if (a == 6)
//do something
else if (a == 7)
//do something
else if (a == 8)
//do something
Now imagine, we know that a will mostly be 7 and we execute this block of code several times in a program. Will moving the (a == 7 ) check to top improve any time performance? That is:
if (a == 7)
//do something
else if (a == 1)
//do something
else if (a == 2)
//do something
else if (a == 3)
//do something
and so on. Does it improve anything or it's just wishful thinking?

You can use switch case for improving the performance of program
switch (a)
{
case 1:
break;
case 2:
break;
case 3:
break;
case 4:
break;
case 5:
break;
case 6:
break;
case 7:
break;
}

Since the if conditions are checked in the order specified, yes. Whether it is measurable (and, hence, should you care) will depend on how many times that portion of the code is called.

Imagine you go to a hotel and are given a room with number 7.
You have to go across the hall checking every room until you find the room with number 7.
Will the time taken depend on how many rooms you checked before you got the one you have been alloted?
Yes..
But know this, in your scenario the time difference will be very very minute to be noticed.
For scenarios where there are too many numbers to be checked, putting the one in the beginning which occur many number of times does improve performance. In fact, this methodology is used by some network protocols for comparing protocol numbers

There is some penalty to be paid in case of compiler couldn't make the constructs to a jump table, I would think the switch/case implementation will be compiled as a jump table in assembly and if else not as a jump table then switch/case has an edge over if else. Again I guess this depends on architecture and compilers.
In case of switch/case compiler will be able to generate the asm jump table only based on the constants (eg. consecutive values) that we provide.
The test i ran on my machine gave the assembly for if/else as this (not jump table) ,
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $7, -4(%rbp)
cmpl $1, -4(%rbp)
jne .L2
movl $97, %edi
call putchar
jmp .L3
.L2:
**cmpl $2, -4(%rbp)
jne .L4**
movl $97, %edi
call putchar
jmp .L3
.L4:
**cmpl $3, -4(%rbp)
jne .L5**
movl $97, %edi
call putchar
jmp .L3
.L5:
**cmpl $4, -4(%rbp)
jne .L6**
movl $97, %edi
call putchar
jmp .L3
.L6:
**cmpl $5, -4(%rbp)
jne .L7**
movl $97, %edi
call putchar
jmp .L3
.L7:
**cmpl $6, -4(%rbp)
jne .L8**
movl $97, %edi
call putchar
jmp .L3
.L8:
cmpl $7, -4(%rbp)
jne .L9
movl $97, %edi
call putchar
jmp .L3
.L9:
cmpl $8, -4(%rbp)
jne .L3
movl $97, %edi
call putchar
.L3:
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
But for switch/case (jump table),
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $7, -4(%rbp)
cmpl $7, -4(%rbp)
ja .L2
movl -4(%rbp), %eax
movq .L4(,%rax,8), %rax
jmp *%rax
.section .rodata
.align 8
.align 4
.L4:
.quad .L2
.quad .L3
.quad .L5
.quad .L6
.quad .L7
.quad .L8
.quad .L9
.quad .L10
.text
.L3:
movl $97, %edi
call putchar
jmp .L2
.L5:
movl $97, %edi
call putchar
jmp .L2
.L6:
movl $97, %edi
call putchar
jmp .L2
.L7:
movl $97, %edi
call putchar
jmp .L2
.L8:
movl $97, %edi
call putchar
jmp .L2
.L9:
movl $97, %edi
call putchar
jmp .L2
.L10:
movl $97, %edi
call putchar
nop
.L2:
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
From the tests i feel that switch/case is better as it doesn't have to go through the earlier entries to find a match.
I would suggest to try gcc -S option to generate the assembly to check the asm to see.

TL;DR version
For so few values, any differences in speed will be immeasurably small, and you'd be better off sticking with the more straightforward, easier-to-understand version. It isn't until you need to start searching through tables containing thousands to millions of entries that you'll want something smarter than a linear ordered search.
James Michener Version
Another possibility not yet mentioned is to do a partitioned search, like so:
if ( a > 4 )
{
if ( a > 6 )
{
if ( a == 7 ) // do stuff
else // a == 8, do stuff
}
else
{
if ( a == 5 ) // do stuff
else // a == 6, do stuff
}
}
else
{
if ( a > 2 )
{
if ( a == 3 ) // do stuff
else // a == 4, do stuff
}
else
{
if ( a == 1 ) // do stuff
else // a == 2, do stuff
}
}
No more than three tests are performed for any value of a. Of course, no less than three tests are performed for any value of a, either. On average, it should give better performance than the naive 1-8 search when the majority of inputs are 7, but...
As with all things performance-related, the rule is measure, don't guess. Code up different versions, profile them, analyze the results. For testing against so few values, it's going to be hard to get reliable numbers; you'll need to execute each method thousands of times for a given value just to get a useful non-zero time measurement (it also means that any difference between the methods will be ridiculously small).
Stuff like this can also be affected by compiler optimization settings. You'll want to build at different optimization levels and re-run your tests.
Just for giggles, I coded up my own version measuring several different approaches:
naive - the straightforward test from 1 to 8 in order;
sevenfirst - check for 7 first, then 1 - 6 and 8;
eightfirst - check from 8 to 1 in reverse order;
partitioned - use the partitioned search above;
switcher - use a switch statement instead of if-else;
I used the following test harness:
int main( void )
{
size_t counter[9] = {0};
struct timeval start, end;
unsigned long total_nsec;
void (*funcs[])(int, size_t *) = { naive, sevenfirst, eightfirst, partitioned, switcher };
srand(time(NULL));
printf("%15s %15s %15s %15s %15s %15s\n", "test #", "naive", "sevenfirst", "eightfirst", "partitioned", "switcher" );
printf("%15s %15s %15s %15s %15s %15s\n", "------", "-----", "----------", "----------", "-----------", "--------" );
unsigned long times[5] = {0};
for ( size_t t = 0; t < 20; t++ )
{
printf( "%15zu ", t );
for ( size_t f = 0; f < 5; f ++ )
{
total_nsec = 0;
for ( size_t i = 0; i < 1000; i++ )
{
int a = generate();
gettimeofday( &start, NULL );
for ( size_t j = 0; j < 10000; j++ )
(*funcs[f])( a, counter );
gettimeofday( &end, NULL );
}
total_nsec += end.tv_usec - start.tv_usec;
printf( "%15lu ", total_nsec );
times[f] += total_nsec;
memset( counter, 0, sizeof counter );
}
putchar('\n');
}
putchar ('\n');
printf( "%15s ", "average:" );
for ( size_t i = 0; i < 5; i++ )
printf( "%15f ", (double) times[i] / 20 );
putchar ('\n' );
return 0;
}
The generate function produces random numbers from 1 through 8, weighted so that 7 appears half the time. I run each method 10000 times per generated value to get measurable times, for 1000 generated values.
I didn't want the performance difference between the various control structures to get swamped by the // do stuff code, so each case just increments a counter, such as
if ( a == 1 )
counter[1]++;
This also gave me a way to verify that my number generator was working properly.
I run through the whole sequence 20 times and average the results. Even so, the numbers can vary a bit from run to run, so don't trust them too deeply. If nothing else, they show that changes at this level don't result in huge improvements. For example:
test # naive sevenfirst eightfirst partitioned switcher
------ ----- ---------- ---------- ----------- --------
0 121 100 118 119 111
1 110 100 131 120 115
2 110 100 125 121 111
3 115 125 117 105 110
4 120 116 125 110 115
5 129 100 110 106 116
6 115 176 105 106 115
7 111 100 111 106 110
8 139 100 106 111 116
9 125 100 136 106 111
10 106 100 105 106 111
11 126 112 135 105 115
12 116 120 135 110 115
13 120 105 106 111 115
14 120 105 105 106 110
15 100 131 106 118 115
16 106 113 116 111 110
17 106 105 105 118 111
18 121 113 103 106 115
19 130 101 105 105 116
average: 117.300000 111.100000 115.250000 110.300000 113.150000
Numbers are in microseconds. The code was built using gcc 4.1.2 with no optimization running on a SLES 10 system1.
So, running each method 10000 times for 1000 values, averaged over 20 runs, gives a total variation of about 7 μsec. That's really not worth getting exercised over. For something that's only searching among 8 distinct values and isn't going to run more than "several times", you're not going to see any measurable improvement in performance regardless of the method used. Stick with the method that's the easiest to read and understand.
Now, for searching a table containing several hundreds to thousands to millions of entries, you definitely want to use something smarter than a linear search.
1. It should go without saying, but the results above are only valid for this code built with this specific compiler and running on this specific system.

There should be a slight difference, but this would depend on the platform and the nature of the comparison - different compilers may optimise something like this differently, different architectures will have different effects as well, and it also depends on what the comparison actually is (if it is a more complex comparison than a simple primitive type comparison, for example)
It is probably good practice to test the specific case you are actually going to use, IF this is something that is likely to actually be a performance bottleneck.
Alternatively, a switch statement, if usable, should have the same performance for any value independent of order because it is implemented using an offset in memory as opposed to successive comparisons.

Probably not, you still have the same number of conditions and still either of them might evaluate to true, even if you check for a == 7 first, all other conditions are potentially true therefore will be evaluated.
The block of code that eis executed if a == 7 might be executed quicker when the program runs - but essentially you're code is still the same, with the same number of statements.

Clang giving very different performance for expressions which intuitively should be equivalent

Could someone explain me these considerable performance differences between these expressions which I would expect to give similar performance. I'm compiling with Apple LLVM version 5.1 (clang-503.0.38) (based on LLVM 3.4svn) in release mode.
Here's my test code (just change CASE to 1, 2, 3 or 4 to test yourself):
#include <iostream>
#include <chrono>
#define CASE 1
inline int foo(int n) {
return
#if CASE == 1
(n % 2) ? 9 : 6
#elif CASE == 2
(n % 2) == true ? 9 : 6
#elif CASE == 3
6 + (n % 2) * 3
#elif CASE == 4
6 + bool(n % 2) * 3
#endif
;
}
int main(int argc, const char* argv[])
{
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
int n = argc;
for (int i = 0; i < 100000000; ++i) {
n += foo(n);
}
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "\n";
std::cout << "value: " << n << "\n";
return 0;
}
And here are the timings I get:
CASE EXPRESSION TIME
1 (n % 2) ? 9 : 6 0.1585
2 (n % 2) == true ? 9 : 6 0.3491
3 6 + (n % 2) * 3 0.2559
4 6 + bool(n % 2) * 3 0.1906
Here's the difference in assembly between CASE 1 and CASE 2:
CASE 1:
Ltmp12:
LBB0_1: ## =>This Inner Loop Header: Depth=1
##DEBUG_VALUE: main:argv <- RSI
##DEBUG_VALUE: i <- 0
.loc 1 24 0 ## /Test/main.cpp:24:0
movl %ebx, %ecx
andl $1, %ecx
leal (%rcx,%rcx,2), %ecx
Ltmp13:
.loc 1 48 14 ## /Test/main.cpp:48:14
leal 6(%rbx,%rcx), %ebx
CASE 2:
Ltmp12:
LBB0_1: ## =>This Inner Loop Header: Depth=1
##DEBUG_VALUE: main:argv <- RSI
##DEBUG_VALUE: i <- 0
.loc 1 24 0 ## /Test/main.cpp:24:0
movl %ebx, %ecx
shrl $31, %ecx
addl %ebx, %ecx
andl $-2, %ecx
movl %ebx, %edx
subl %ecx, %edx
cmpl $1, %edx
sete %cl
movzbl %cl, %ecx
leal (%rcx,%rcx,2), %ecx
Ltmp13:
.loc 1 48 14 ## /Test/main.cpp:48:14
leal 6(%rbx,%rcx), %ebx
And here's the difference in assembly between CASE 3 and CASE 4:
CASE 3:
Ltmp12:
LBB0_1: ## =>This Inner Loop Header: Depth=1
##DEBUG_VALUE: main:argv <- RSI
##DEBUG_VALUE: i <- 0
.loc 1 24 0 ## /Test/main.cpp:24:0
movl %ebx, %ecx
shrl $31, %ecx
addl %ebx, %ecx
andl $-2, %ecx
movl %ebx, %edx
subl %ecx, %edx
leal (%rdx,%rdx,2), %ecx
Ltmp13:
.loc 1 48 14 ## /Test/main.cpp:48:14
leal 6(%rbx,%rcx), %ebx
CASE 4:
Ltmp12:
LBB0_1: ## =>This Inner Loop Header: Depth=1
##DEBUG_VALUE: main:argv <- RSI
##DEBUG_VALUE: i <- 0
.loc 1 24 0 ## /Test/main.cpp:24:0
movl %ebx, %ecx
andl $1, %ecx
negl %ecx
andl $3, %ecx
Ltmp13:
.loc 1 48 14 ## /Test/main.cpp:48:14
leal 6(%rbx,%rcx), %ebx

This answer currently only covers the difference between the first two cases.
What are the possible values of (n % 2)? Surely, it's 0 and 1, right?
Wrong. It's 0, 1 and -1. Because n is a signed integer, and the result of % can be negative.
(n % 2) ? 6 : 9 implicitly converts the expression n % 2 to bool. The result of this conversion is true IFF the value nonzero. The conversion therefore is equivalent to (n % 2) != 0.
In, (n % 2) == true ? 6 : 9, for the comparison (n % 2) == true, the usual arithmetic conversions are applied to both sides (note that bool is an arithmetic type). true is promoted to an int of value 1. So the conversion is equivalent to (n % 2) == 1.
The two conversions (n % 2) != 0 and (n % 2) == 1 yield different results for a negative n: Let n = -1. Then n % 2 == -1, and -1 != 0 is true, but -1 == 1 is false.
Therefore, the compiler has to introduce some additional complexity to cope with the sign.
The difference in runtime disappears if you make n an unsigned integer, or remove the sign issue any other way (e.g. by comparing n % 2 != false).
I got this idea by looking at the assembly output, especially the following line:
shrl $31, %eax
Using the highest bit made no sense to me first, until I realized the highest bit is used as the sign.

why comparing char variables twice is faster than comparing short variables once

I have thought one compare must be faster than two. But after my test, I found in debug mode short compare is a bit faster, and in release mode char compare is faster. And I want to know the true reason.
Following is the test code and test result. I wrote two simple functions, func1() using two char compares, and func2() using one short compare. The main function returns temporary return value to avoid compile optimization ignoring my test code. My compiler is GCC 4.7.2, CPU Intel® Xeon® CPU E5-2430 0 # 2.20GHz (VM).
inline int func1(unsigned char word[2])
{
if (word[0] == 0xff && word[1] == 0xff)
return 1;
return 0;
}
inline int func2(unsigned char word[2])
{
if (*(unsigned short*)word == 0xffff)
return 1;
return 0;
}
int main()
{
int n_ret = 0;
for (int j = 0; j < 10000; ++j)
for (int i = 0; i < 70000; ++i)
n_ret += func2((unsigned char*)&i);
return n_ret;
}
Debug mode:
func1 func2
real 0m3.621s 0m3.586s
user 0m3.614s 0m3.579s
sys 0m0.001s 0m0.000s
Release mode:
func1 func2
real 0m0.833s 0m0.880s
user 0m0.831s 0m0.878s
sys 0m0.000s 0m0.002s
func1 edition's assembly code:
.cfi_startproc
movl $10000, %esi
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L6:
movl $1, %edx
xorl %ecx, %ecx
.p2align 4,,10
.p2align 3
.L8:
movl %edx, -24(%rsp)
addl $1, %edx
addl %ecx, %eax
cmpl $70001, %edx
je .L3
xorl %ecx, %ecx
cmpb $-1, -24(%rsp)
jne .L8
xorl %ecx, %ecx
cmpb $-1, -23(%rsp)
sete %cl
jmp .L8
.p2align 4,,10
.p2align 3
.L3:
subl $1, %esi
jne .L6
rep
ret
.cfi_endproc
func2 edition's assembly code:
.cfi_startproc
movl $10000, %esi
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L4:
movl $1, %edx
xorl %ecx, %ecx
jmp .L3
.p2align 4,,10
.p2align 3
.L7:
movzwl -24(%rsp), %ecx
.L3:
cmpw $-1, %cx
movl %edx, -24(%rsp)
sete %cl
addl $1, %edx
movzbl %cl, %ecx
addl %ecx, %eax
cmpl $70001, %edx
jne .L7
subl $1, %esi
jne .L4
rep
ret
.cfi_endproc

In GCC 4.6.3 the code is different for the first and second pieces of code, and the runtime for the func1 option is noticeably slower if you run it for long enough. Unfortunately, with your very short runtime, the two appear similar in time.
Increasing the outer loop by a factor of 10 means it takes about 6 seconds for func2, and 10 seconds for func1. This s using gcc -std=c99 -O3 to compile the code.
The main difference, I expect, is from the extra branch introduced with the && statement. And the extra xorl %ecx, %ecx doesn't help much (I get the same, although my code looks subtly different when it comes to label names).
Edit: I did try to come up with a branchless solution using and instead of a branch, but the compile refuses to inline the function, so it takes 30 seconds instead of 10.
Benchmarks run on:
AMD Phenom(tm) II X4 965
Runs at 3.4 GHz.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Basic block in LLVM IR seems broken in assembly codes - llvm

Related

Am I correctly reasoning about cache performance?

Why does this C++ function produce so many branch mispredictions?

If Else construct

Clang giving very different performance for expressions which intuitively should be equivalent

why comparing char variables twice is faster than comparing short variables once

Categories

Resources