I want to optimize my code for vectorization using
-msse2 -ftree-vectorizer-verbose=2.
I have the following simple code:
int main(){
int a[2048], b[2048], c[2048];
int i;
for (i=0; i<2048; i++){
b[i]=0;
c[i]=0;
}
for (i=0; i<2048; i++){
a[i] = b[i] + c[i];
}
return 0;
}
Why do I get the note
test.cpp:10: note: not vectorized: not enough data-refs in basic block.
Thanks!
For what it's worth, after adding an asm volatile("": "+m"(a), "+m"(b), "+m"(c)::"memory"); near the end of main, my copy of gcc emits this:
400610: 48 81 ec 08 60 00 00 sub $0x6008,%rsp
400617: ba 00 20 00 00 mov $0x2000,%edx
40061c: 31 f6 xor %esi,%esi
40061e: 48 8d bc 24 00 20 00 lea 0x2000(%rsp),%rdi
400625: 00
400626: e8 b5 ff ff ff callq 4005e0 <memset#plt>
40062b: ba 00 20 00 00 mov $0x2000,%edx
400630: 31 f6 xor %esi,%esi
400632: 48 8d bc 24 00 40 00 lea 0x4000(%rsp),%rdi
400639: 00
40063a: e8 a1 ff ff ff callq 4005e0 <memset#plt>
40063f: 31 c0 xor %eax,%eax
400641: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400648: c5 f9 6f 84 04 00 20 vmovdqa 0x2000(%rsp,%rax,1),%xmm0
40064f: 00 00
400651: c5 f9 fe 84 04 00 40 vpaddd 0x4000(%rsp,%rax,1),%xmm0,%xmm0
400658: 00 00
40065a: c5 f8 29 04 04 vmovaps %xmm0,(%rsp,%rax,1)
40065f: 48 83 c0 10 add $0x10,%rax
400663: 48 3d 00 20 00 00 cmp $0x2000,%rax
400669: 75 dd jne 400648 <main+0x38>
So it recognised that the first loop was just doing memset to a couple arrays and the second loop was doing a vector addition, which it appropriately vectorised.
I'm using gcc version 4.9.0 20140521 (prerelease) (GCC).
An older machine with gcc version 4.7.2 (Debian 4.7.2-5) also vectorises the loop, but in a different way. Your -ftree-vectorizer-verbose=2 setting makes it emit the following output:
Analyzing loop at foo155.cc:10
Vectorizing loop at foo155.cc:10
10: LOOP VECTORIZED.
foo155.cc:1: note: vectorized 1 loops in function.
You probably goofed your compiler flags (I used g++ -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 -march=native foo155.cc -o foo155 to build) or have a really old compiler.
remove the first loop and do this
int a[2048], b[2048], c[2048] = {0};
also try this tag
-ftree-vectorize
instead of
-msse2 -ftree-vectorizer-verbose=2
Related
I don't understand how std::memory_order_XXX(like memory_order_release/memory_order_acquire ...) works.
From some documents, it shows that these memory mode have different feature, but I'm really confused that they have the same assemble code, what determined the differences?
That code:
static std::atomic<long> gt;
void test1() {
gt.store(1, std::memory_order_release);
gt.store(2, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
Corresponds to:
00000000000007a0 <_Z5test1v>:
7a0: 55 push %rbp
7a1: 48 89 e5 mov %rsp,%rbp
7a4: 48 83 ec 30 sub $0x30,%rsp
**memory_order_release:
7a8: 48 c7 45 f8 01 00 00 movq $0x1,-0x8(%rbp)
7af: 00
7b0: c7 45 e8 03 00 00 00 movl $0x3,-0x18(%rbp)
7b7: 8b 45 e8 mov -0x18(%rbp),%eax
7ba: be ff ff 00 00 mov $0xffff,%esi
7bf: 89 c7 mov %eax,%edi
7c1: e8 b1 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7c6: 89 45 ec mov %eax,-0x14(%rbp)
7c9: 48 8b 55 f8 mov -0x8(%rbp),%rdx
7cd: 48 8d 05 44 08 20 00 lea 0x200844(%rip),%rax # 201018 <_ZL2gt>
7d4: 48 89 10 mov %rdx,(%rax)
7d7: 0f ae f0 mfence**
**memory_order_relaxed:
7da: 48 c7 45 f0 02 00 00 movq $0x2,-0x10(%rbp)
7e1: 00
7e2: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp)
7e9: 8b 45 e0 mov -0x20(%rbp),%eax
7ec: be ff ff 00 00 mov $0xffff,%esi
7f1: 89 c7 mov %eax,%edi
7f3: e8 7f 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
7f8: 89 45 e4 mov %eax,-0x1c(%rbp)
7fb: 48 8b 55 f0 mov -0x10(%rbp),%rdx
7ff: 48 8d 05 12 08 20 00 lea 0x200812(%rip),%rax # 201018 <_ZL2gt>
806: 48 89 10 mov %rdx,(%rax)
809: 0f ae f0 mfence**
**memory_order_acquire:
80c: c7 45 d8 02 00 00 00 movl $0x2,-0x28(%rbp)
813: 8b 45 d8 mov -0x28(%rbp),%eax
816: be ff ff 00 00 mov $0xffff,%esi
81b: 89 c7 mov %eax,%edi
81d: e8 55 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
822: 89 45 dc mov %eax,-0x24(%rbp)
825: 48 8d 05 ec 07 20 00 lea 0x2007ec(%rip),%rax # 201018 <_ZL2gt>
82c: 48 8b 00 mov (%rax),%rax**
**memory_order_relaxed:
82f: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
836: 8b 45 d0 mov -0x30(%rbp),%eax
839: be ff ff 00 00 mov $0xffff,%esi
83e: 89 c7 mov %eax,%edi
840: e8 32 00 00 00 callq 877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
845: 89 45 d4 mov %eax,-0x2c(%rbp)
848: 48 8d 05 c9 07 20 00 lea 0x2007c9(%rip),%rax # 201018 <_ZL2gt>
84f: 48 8b 00 mov (%rax),%rax**
852: 90 nop
853: c9 leaveq
854: c3 retq
00000000000008cc <_ZStanSt12memory_orderSt23__memory_order_modifier>:
8cc: 55 push %rbp
8cd: 48 89 e5 mov %rsp,%rbp
8d0: 89 7d fc mov %edi,-0x4(%rbp)
8d3: 89 75 f8 mov %esi,-0x8(%rbp)
8d6: 8b 55 fc mov -0x4(%rbp),%edx
8d9: 8b 45 f8 mov -0x8(%rbp),%eax
8dc: 21 d0 and %edx,%eax
8de: 5d pop %rbp
8df: c3 retq
I expect different memory mode has different implements on assemble code,
but setting different mode value is no effect on assemble, who can explain this?
Each memory model setting has its semantics. Compiler is obliged to satisfy this semantics, meaning that:
It disallows compiler to perform certain optimizations, such as reordering of reads and writes.
It instructs the compiler to propagate the very same message down to the hardware. How it is done, depends on the platform. x86_64 itself provides very strong memory model. Hence in almost all cases you will see no difference in generated assembler code for x86_64 no matter what memory model you choose. However, on RISC architectures (e.g. ARM), you will see the difference because compiler will have to insert memory barriers. Type of memory barrier depends on the selected memory model setting.
EDIT: Have a look at the JSR-133. It is very old and is about Java, but it provides the nicest explanation about memory model from the compiler perspective that I know. In particular, look at the table of memory barrier instructions for different architectures.
Given the code:
#include <atomic>
static std::atomic<long> gt;
void test1() {
gt.store(41, std::memory_order_release);
gt.store(42, std::memory_order_relaxed);
gt.load(std::memory_order_acquire);
gt.load(std::memory_order_relaxed);
}
At decent optimization level there is no garbage assembly moving values around on registers than the stack:
test1():
movq $41, gt(%rip)
movq $42, gt(%rip)
movq gt(%rip), %rax
movq gt(%rip), %rax
ret
We see that the exact same code is generated for the different memory orders; although testing different instructions in the same function in sequence is very bad practice as C++ instructions don't have to be compiled independently and context might influence code generation. But with the current code generation in GCC, it compiles each statement involving an atomic as its own. Good practice is to have a different function for each statement.
The same code is generated here because no special instruction happens to be needed for these memory orders.
Updated: 19 Aug. 2017, 16:49 UTC
I’m writing an AVX code to multiply a vector with 4 billion components by a constant, however, I see no difference between my small -- I hope -- optimized AVX code and the long scalar compiler optimized version.
Both versions run between 410 ms - 400 ms.
Can someone tell me why it is occurring?
And why the large assembly generated by the compiler code takes almost the same time even it's larger ?
It's an important question, because if small computations -- like this multiplication -- have no improvement then it has no sense to use made the manual code in an Intel Core CPU. Perhaps in an Intel Xeon ( with 16 components ) or for more complex computations.
I'm compiling with G++ with parameters:
g++ -O3 -mtune=native -march=native -mavx -g3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/Test AVX.d" -MT"src/Test\ AVX.d" -o "src/Test AVX.o" "../src/Test AVX.cpp"
My CPU is a Intel(R) Core(TM) i5-5200U CPU # 2.20GHz.
There is the AVX code:
/**
* Run AVX Code
*/
void AVX() {
// Loop control
uint_fast32_t loop = 0;
// The constant
__m256 _const = _mm256_set1_ps(5.0f);
// The register for multiplication
__m256 _ymm0 = _mm256_setzero_ps();
// A "buffer" between the vector and the YMM0 register
float f_data[8];
// The main loop
for ( loop = 0 ; loop < SIZE ; loop = loop + 8 ) {
// Load to buffer
f_data[0] = vector[loop];
f_data[1] = vector[loop+1];
f_data[2] = vector[loop+2];
f_data[3] = vector[loop+3];
f_data[4] = vector[loop+4];
f_data[5] = vector[loop+5];
f_data[6] = vector[loop+6];
f_data[7] = vector[loop+7];
/*
* I tried to use pointers insted to copy
* the data, but the software crash
*
* float **f_data;
* f_data = float*[8];
*
* f_data[0] = &vector[loop];
* ...
*
*/
// Load to XMM and YMM Registers
_ymm0 = _mm256_load_ps(f_data);
// Do the multiplication
_ymm0 = _mm256_mul_ps(_ymm0,_const);
// Copy the results from the register to the "buffer"
_mm256_store_ps(f_data,_ymm0);
// Copy from the "buffer" to the vector
vector[loop] = f_data[0];
vector[loop+1] = f_data[1];
vector[loop+2] = f_data[2];
vector[loop+3] = f_data[3];
vector[loop+4] = f_data[4];
vector[loop+5] = f_data[5];
vector[loop+6] = f_data[6];
vector[loop+7] = f_data[7];
}
}
The AVX assembled:
0000000000400de0 <_Z3AVXv>:
400de0: 48 8b 05 b1 13 20 00 mov rax,QWORD PTR [rip+0x2013b1] # 602198 <vector>
400de7: c5 fc 28 0d 71 06 00 vmovaps ymm1,YMMWORD PTR [rip+0x671] # 401460 <_IO_stdin_used+0x40>
400dee: 00
400def: 48 8d 90 00 00 00 40 lea rdx,[rax+0x40000000]
400df6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
400dfd: 00 00 00
400e00: c5 f4 59 00 vmulps ymm0,ymm1,YMMWORD PTR [rax]
400e04: 48 83 c0 20 add rax,0x20
400e08: c5 fc 11 40 e0 vmovups YMMWORD PTR [rax-0x20],ymm0
400e0d: 48 39 c2 cmp rdx,rax
400e10: 75 ee jne 400e00 <_Z3AVXv+0x20>
400e12: c5 f8 77 vzeroupper
400e15: c3 ret
400e16: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
400e1d: 00 00 00
The Serial Version:
/**
* Run Compiler optimized version
*/
void Serial() {
uint_fast32_t loop;
// Do the multiplication
for ( loop = 0 ; loop < SIZE ; loop ++)
vector[loop] *= 5;
}
The serial assembled:
It's more large, move the data more times and take almost the same time. How it's possible ?
0000000000400e80 <_Z6Serialv>:
400e80: 48 8b 35 11 13 20 00 mov rsi,QWORD PTR [rip+0x201311] # 602198 <vector>
400e87: 48 89 f0 mov rax,rsi
400e8a: 48 c1 e8 02 shr rax,0x2
400e8e: 48 f7 d8 neg rax
400e91: 83 e0 07 and eax,0x7
400e94: 0f 84 96 01 00 00 je 401030 <_Z6Serialv+0x1b0>
400e9a: c5 fa 10 05 7a 04 00 vmovss xmm0,DWORD PTR [rip+0x47a] # 40131c <_IO_stdin_used+0x1c>
400ea1: 00
400ea2: c5 fa 59 0e vmulss xmm1,xmm0,DWORD PTR [rsi]
400ea6: c5 fa 11 0e vmovss DWORD PTR [rsi],xmm1
400eaa: 48 83 f8 01 cmp rax,0x1
400eae: 0f 84 8c 01 00 00 je 401040 <_Z6Serialv+0x1c0>
400eb4: c5 fa 59 4e 04 vmulss xmm1,xmm0,DWORD PTR [rsi+0x4]
400eb9: c5 fa 11 4e 04 vmovss DWORD PTR [rsi+0x4],xmm1
400ebe: 48 83 f8 02 cmp rax,0x2
400ec2: 0f 84 89 01 00 00 je 401051 <_Z6Serialv+0x1d1>
400ec8: c5 fa 59 4e 08 vmulss xmm1,xmm0,DWORD PTR [rsi+0x8]
400ecd: c5 fa 11 4e 08 vmovss DWORD PTR [rsi+0x8],xmm1
400ed2: 48 83 f8 03 cmp rax,0x3
400ed6: 0f 84 86 01 00 00 je 401062 <_Z6Serialv+0x1e2>
400edc: c5 fa 59 4e 0c vmulss xmm1,xmm0,DWORD PTR [rsi+0xc]
400ee1: c5 fa 11 4e 0c vmovss DWORD PTR [rsi+0xc],xmm1
400ee6: 48 83 f8 04 cmp rax,0x4
400eea: 0f 84 2d 01 00 00 je 40101d <_Z6Serialv+0x19d>
400ef0: c5 fa 59 4e 10 vmulss xmm1,xmm0,DWORD PTR [rsi+0x10]
400ef5: c5 fa 11 4e 10 vmovss DWORD PTR [rsi+0x10],xmm1
400efa: 48 83 f8 05 cmp rax,0x5
400efe: 0f 84 6f 01 00 00 je 401073 <_Z6Serialv+0x1f3>
400f04: c5 fa 59 4e 14 vmulss xmm1,xmm0,DWORD PTR [rsi+0x14]
400f09: c5 fa 11 4e 14 vmovss DWORD PTR [rsi+0x14],xmm1
400f0e: 48 83 f8 06 cmp rax,0x6
400f12: 0f 84 6c 01 00 00 je 401084 <_Z6Serialv+0x204>
400f18: c5 fa 59 46 18 vmulss xmm0,xmm0,DWORD PTR [rsi+0x18]
400f1d: 41 b9 f9 ff ff 0f mov r9d,0xffffff9
400f23: 41 ba 07 00 00 00 mov r10d,0x7
400f29: c5 fa 11 46 18 vmovss DWORD PTR [rsi+0x18],xmm0
400f2e: 41 b8 00 00 00 10 mov r8d,0x10000000
400f34: c5 fc 28 0d 04 04 00 vmovaps ymm1,YMMWORD PTR [rip+0x404] # 401340 <_IO_stdin_used+0x40>
400f3b: 00
400f3c: 48 8d 0c 86 lea rcx,[rsi+rax*4]
400f40: 31 d2 xor edx,edx
400f42: 49 29 c0 sub r8,rax
400f45: 31 c0 xor eax,eax
400f47: 4c 89 c7 mov rdi,r8
400f4a: 48 c1 ef 03 shr rdi,0x3
400f4e: 66 90 xchg ax,ax
400f50: c5 f4 59 04 01 vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1]
400f55: 48 83 c2 01 add rdx,0x1
400f59: c5 fc 29 04 01 vmovaps YMMWORD PTR [rcx+rax*1],ymm0
400f5e: 48 83 c0 20 add rax,0x20
400f62: 48 39 d7 cmp rdi,rdx
400f65: 77 e9 ja 400f50 <_Z6Serialv+0xd0>
400f67: 4c 89 c1 mov rcx,r8
400f6a: 4c 89 ca mov rdx,r9
400f6d: 48 83 e1 f8 and rcx,0xfffffffffffffff8
400f71: 49 8d 04 0a lea rax,[r10+rcx*1]
400f75: 48 29 ca sub rdx,rcx
400f78: 49 39 c8 cmp r8,rcx
400f7b: 0f 84 98 00 00 00 je 401019 <_Z6Serialv+0x199>
400f81: 48 8d 0c 86 lea rcx,[rsi+rax*4]
400f85: c5 fa 10 05 8f 03 00 vmovss xmm0,DWORD PTR [rip+0x38f] # 40131c <_IO_stdin_used+0x1c>
400f8c: 00
400f8d: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400f91: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400f95: 48 8d 48 01 lea rcx,[rax+0x1]
400f99: 48 83 fa 01 cmp rdx,0x1
400f9d: 74 7a je 401019 <_Z6Serialv+0x199>
400f9f: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fa3: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fa7: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fab: 48 8d 48 02 lea rcx,[rax+0x2]
400faf: 48 83 fa 02 cmp rdx,0x2
400fb3: 74 64 je 401019 <_Z6Serialv+0x199>
400fb5: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fb9: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fbd: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fc1: 48 8d 48 03 lea rcx,[rax+0x3]
400fc5: 48 83 fa 03 cmp rdx,0x3
400fc9: 74 4e je 401019 <_Z6Serialv+0x199>
400fcb: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fcf: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fd3: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fd7: 48 8d 48 04 lea rcx,[rax+0x4]
400fdb: 48 83 fa 04 cmp rdx,0x4
400fdf: 74 38 je 401019 <_Z6Serialv+0x199>
400fe1: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fe5: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fe9: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fed: 48 8d 48 05 lea rcx,[rax+0x5]
400ff1: 48 83 fa 05 cmp rdx,0x5
400ff5: 74 22 je 401019 <_Z6Serialv+0x199>
400ff7: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400ffb: 48 83 c0 06 add rax,0x6
400fff: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
401003: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
401007: 48 83 fa 06 cmp rdx,0x6
40100b: 74 0c je 401019 <_Z6Serialv+0x199>
40100d: 48 8d 04 86 lea rax,[rsi+rax*4]
401011: c5 fa 59 00 vmulss xmm0,xmm0,DWORD PTR [rax]
401015: c5 fa 11 00 vmovss DWORD PTR [rax],xmm0
401019: c5 f8 77 vzeroupper
40101c: c3 ret
40101d: 41 ba 04 00 00 00 mov r10d,0x4
401023: 41 b9 fc ff ff 0f mov r9d,0xffffffc
401029: e9 00 ff ff ff jmp 400f2e <_Z6Serialv+0xae>
40102e: 66 90 xchg ax,ax
401030: 41 b9 00 00 00 10 mov r9d,0x10000000
401036: 45 31 d2 xor r10d,r10d
401039: e9 f0 fe ff ff jmp 400f2e <_Z6Serialv+0xae>
40103e: 66 90 xchg ax,ax
401040: 41 b9 ff ff ff 0f mov r9d,0xfffffff
401046: 41 ba 01 00 00 00 mov r10d,0x1
40104c: e9 dd fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401051: 41 ba 02 00 00 00 mov r10d,0x2
401057: 41 b9 fe ff ff 0f mov r9d,0xffffffe
40105d: e9 cc fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401062: 41 ba 03 00 00 00 mov r10d,0x3
401068: 41 b9 fd ff ff 0f mov r9d,0xffffffd
40106e: e9 bb fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401073: 41 ba 05 00 00 00 mov r10d,0x5
401079: 41 b9 fb ff ff 0f mov r9d,0xffffffb
40107f: e9 aa fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401084: 41 ba 06 00 00 00 mov r10d,0x6
40108a: 41 b9 fa ff ff 0f mov r9d,0xffffffa
401090: e9 99 fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401095: 90 nop
401096: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
40109d: 00 00 00
The full code:
#include <iostream>
#include <xmmintrin.h>
#include <immintrin.h>
using namespace std;
/**
* The vector size
* 268435456 -> 32*8388608 -> 2^32
*/
#define SIZE 268435456
/**
* The vector for computations
*/
float *vector;
/**
* Run AVX Code
*/
void AVX() { ... }
/**
* Run Compiler optimized version
*/
void Serial() { ... }
/**
* Create the vector
*/
void create() {
vector = new float[SIZE];
}
/**
* Fill the vector with data
* to be used for validation
*/
void fill() {
uint_fast32_t loop = 0;
// Fill the vector
for ( loop = 0 ; loop < SIZE ; loop++ )
vector[loop] = 1;
}
/**
* A validation to ensure the compiler have
* computed all the vector data
*/
void validation() {
// The loop variable
unsigned long loop = 0;
unsigned long errors = 0;
unsigned long checks = 0;
for ( loop = 0 ; loop < SIZE ; loop ++ ) {
// All the vector must be 5
if ( vector[loop] != 5 ) {
errors ++;
// To avoid to show too many errors
if ( errors < 12 )
std::cout << loop << ": " << vector[loop] << std::endl;
}
checks ++;
}
// The result
std::cout << "Errors: " << errors << "\nChecks: " << checks << std::endl;
}
int main() {
// Create the vector
create();
// Fill with data
//fill();
// The tests
//Serial();
AVX();
/*
* To ensure that the g++ optimization have executed the loop
*/
//validation();
}
Compiled with:
g++ -O3 -mtune=native -march=native -mavx -g3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/Test AVX.d" -MT"src/Test\ AVX.d" -o "src/Test AVX.o" "../src/Test AVX.cpp"
Multiplying by 5 is so trivial that you should do that on the fly next time you read the array, or fold it into the code that wrote this array. Loading all that data from RAM into the CPU and storing it back again just to multiply by 5.0 is not efficient.
If you can't just fold it into a different pass of your algorithm, try cache-blocking aka loop-tiling to run multiple steps of your algorithm over a part of this array that fits into cache, before moving on to the next cache-sized block.
Your scalar code auto-vectorizes to nearly the same inner loop as your manually-vectorized version. Neither one is unrolled at all.
The extra code size in gcc's version is just scalar startup / cleanup so its inner loop can use aligned loads/stores. gcc fully unrolls those loops.
Also note that your manually-vectorized code doesn't handle the case where SIZE is not a multiple of 8. (gcc does handle the cleanup at the end even then, because it doesn't know where the alignment boundary will be.)
clang usually just uses unaligned loads/stores on arrays that it can't prove at compile time are always aligned. gcc's default behaviour is maybe good for large arrays that actually are misaligned at run-time, but a total waste of I-cache and branches for cases where the data is in fact aligned at run time most of the time, or for small arrays where doing a bunch of branching and scalar iterations isn't worth it.
The inner loops are nearly the same. In your manually vectorized version, gcc managed to optimize away the element-by-element copy through f_data and emit what you would get from _mm256_loadu_ps(&vector[loop]), instead of actually copying to a local and then doing a vector load. And same for storing back into vector[], luckily for you.
# top of inner loop in the manually-vectorized version:
400e00: c5 f4 59 00 vmulps ymm0,ymm1,YMMWORD PTR [rax]
400e04: 48 83 c0 20 add rax,0x20
400e08: c5 fc 11 40 e0 vmovups YMMWORD PTR [rax-0x20],ymm0
400e0d: 48 39 c2 cmp rdx,rax
400e10: 75 ee jne 400e00 <_Z3AVXv+0x20>
gcc's inner loop uses a loop counter separate from the pointer, so it has an extra instruction, and it uses an indexed addressing mode. vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1] can't stay micro-fused on Haswell, so it will issue as 2 fused-domain uops.
# top of gcc's inner loop:
400f50: c5 f4 59 04 01 vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1]
400f55: 48 83 c2 01 add rdx,0x1
400f59: c5 fc 29 04 01 vmovaps YMMWORD PTR [rcx+rax*1],ymm0
400f5e: 48 83 c0 20 add rax,0x20
400f62: 48 39 d7 cmp rdi,rdx
400f65: 77 e9 ja 400f50 <_Z6Serialv+0xd0>
The extra add instruction is another extra uop. This is 6 fused-domain uops (and thus can run at best one iteration per 1.5 cycles, bottlenecked on the front-end).
Your manual version is only 4 fused-domain uops, so it can issue at 1 per clock. It can in theory run that fast if the buffer is hot in L1D cache (or maybe L2), also limited by 1 store per clock.
Of course, since you're running it over a giant buffer, you just bottleneck on memory bandwidth. The minor front-end bottleneck in the auto-vectorized version is a total non-issue. Even an SSE2 version would barely run slower.
You said something about a Xeon with 16 cores. If you want gcc to auto-parallelize as well as SIMD vectorize, you could use OpenMP. As it is, your code is purely single-threaded.
i read some old articles about the local scoped static variable initialzation order problem from
C++ scoped static initialization is not thread-safe back in 2004, and
Function Static Variables in Multi-Threaded Environments in 2006.
then I start to produce an example and check my compiler, gcc 4.4.7
int calcSomething(){}
void foo(){
static int x = calcSomething();
}
int main(){
foo();
return 0;
}
the result from objdump shows:
000000000040061a <_Z3foov>:
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
400628: 75 28 jne 400652 <_Z3foov+0x38>
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
400652: c9 leaveq
400653: c3 retq
unfortunately, my knowledge of asssmbly code is so limited that I cannot tell what compiler does here. Can anyone shed me some light, what this assembly code do? and is it still not thread-safe? I really appreciate some "pseudo code" showing what gcc is doing here.
EDIT-1:
as Jerry commented, I enabled optimization with O2, the assembly code is:
0000000000400620 <_Z3foov>:
400620: 48 83 ec 08 sub $0x8,%rsp
400624: 80 3d 85 04 20 00 00 cmpb $0x0,0x200485(%rip) # 600ab0 <_ZGVZ3foovE1x>
40062b: 74 0b je 400638 <_Z3foov+0x18>
40062d: 48 83 c4 08 add $0x8,%rsp
400631: c3 retq
400632: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400638: bf b0 0a 60 00 mov $0x600ab0,%edi
40063d: e8 9e fe ff ff callq 4004e0 <__cxa_guard_acquire#plt>
400642: 85 c0 test %eax,%eax
400644: 74 e7 je 40062d <_Z3foov+0xd>
400646: c7 05 68 04 20 00 00 movl $0x0,0x200468(%rip) # 600ab8 <_ZZ3foovE1x>
40064d: 00 00 00
400650: bf b0 0a 60 00 mov $0x600ab0,%edi
400655: 48 83 c4 08 add $0x8,%rsp
400659: e9 a2 fe ff ff jmpq 400500 <__cxa_guard_release#plt>
40065e: 66 90 xchg %ax,%ax
Yes. In pseudocode (for the un-optimized case) it's something like:
if (flag_val() != 0) goto done;
if (guard_acquire() != 0) goto done;
x = calcSomething();
guard_release_and_set_flag();
// Note releasing the guard lock causes later
// calls to flag_val() to return non-zero.
done: return
The flag_val() is really a non-blocking check, apparently for efficiency to avoid calling the acquire primitive unless necessary. The flag must be set by guard_release as shown. The acquire seems to be the synchronized call to grab the lock. Only one thread will get a true value back and perform the initialization. After it releases the lock, the non-zero flag prevents any further touches of the lock.
Another interesting tidbit is that the guard data structure is 8 bytes away from the value of x itself in static memory.
Those familiar with the singleton pattern in languages with built-in threads e.g. Java will recognize this!
Addition
A bit more time now, so in a bit more detail:
000000000040061a <_Z3foov>:
; Prepare to access stack variables (never used in un-optimized code).
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
; Test a byte 8 away from the static int x. This is apparently an "initialized" flag.
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
; Goto the end of the function if the byte was no-zero.
400628: 75 28 jne 400652 <_Z3foov+0x38>
; Load the same byte address in di: the argument for the call to
; acquire the guard lock.
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
; Test the return value. Goto end of function if not zero (non-optimized code).
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
; Call the user's initialization function and move result into x.
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
; Load the guard byte's address again and call the release routine.
; This must set the flag to non-zero.
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
; Restore state and return.
400652: c9 leaveq
400653: c3 retq
This listing, although for the LLVM compiler rather than g++ (are you running OS X? OS X aliases g++ to LLVM), agrees with the guesswork above. The set_initialized routine is setting a flag value in guard_release.
There might be any because of inlining of #define statements.
I understand that answer may be compiler dependent, lets asume GCC then.
There already are similar questions about C and about C++, but they are more about usage aspects.
The compiler would treat them the same given basic optimization.
It's fairly easy to check - consider the following c code :
#define a 1
static const int b = 2;
typedef enum {FOUR = 4} enum_t;
int main() {
enum_t c = FOUR;
printf("%d\n",a);
printf("%d\n",b);
printf("%d\n",c);
return 0;
}
compiled with gcc -O3:
0000000000400410 <main>:
400410: 48 83 ec 08 sub $0x8,%rsp
400414: be 01 00 00 00 mov $0x1,%esi
400419: bf 2c 06 40 00 mov $0x40062c,%edi
40041e: 31 c0 xor %eax,%eax
400420: e8 cb ff ff ff callq 4003f0 <printf#plt>
400425: be 02 00 00 00 mov $0x2,%esi
40042a: bf 2c 06 40 00 mov $0x40062c,%edi
40042f: 31 c0 xor %eax,%eax
400431: e8 ba ff ff ff callq 4003f0 <printf#plt>
400436: be 04 00 00 00 mov $0x4,%esi
40043b: bf 2c 06 40 00 mov $0x40062c,%edi
400440: 31 c0 xor %eax,%eax
400442: e8 a9 ff ff ff callq 4003f0 <printf#plt>
Absolutely identical assembly code, hence - the exact same performance and memory usage.
Edit: As Damon stated in the comments, there may be some corner cases such as complicated non literals, but that goes a bit beyond the question.
When used as a constant expression there will be no difference in performance. If used as an lvalue, the static const will need to be defined (memory) and accessed (cpu).
I'm not sure whether I've found a bug in g++ (4.4.1-4ubuntu9), or if I'm doing
something wrong. What I believe I'm seeing is a bug introduced by enabling
optimization with g++ -O2. I've tried to distill the code down to just the
relevant parts.
When optimization is enabled, I have an ASSERT which is failing. When
optimization is disabled, the same ASSERT does not fail. I think I've tracked
it down to the optimization of one function and its callers.
The System
Language: C++
Ubuntu 9.10
g++-4.4.real (Ubuntu 4.4.1-4ubuntu9) 4.4.1
Linux 2.6.31-22-server x86_64
Optimization Enabled
Object compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O2 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
And here is the relevant code from objdump -dg file.o.
00000000000018b0 <helper_function>:
;; This function takes two parameters:
;; pointer to int: %rdi
;; pointer to int[]: %rsi
18b0: 0f b6 07 movzbl (%rdi),%eax
18b3: 83 f8 12 cmp $0x12,%eax
18b6: 74 60 je 1918 <helper_function+0x68>
18b8: 83 f8 17 cmp $0x17,%eax
18bb: 74 5b je 1918 <helper_function+0x68>
...
1918: c7 06 32 00 00 00 movl $0x32,(%rsi)
191e: 66 90 xchg %ax,%ax
1920: c3 retq
0000000000005290 <buggy_invoker>:
... snip ...
52a0: 48 81 ec c8 01 00 00 sub $0x1c8,%rsp
52a7: 48 8d 84 24 a0 01 00 lea 0x1a0(%rsp),%rax
52ae: 00
52af: 48 c7 84 24 a0 01 00 movq $0x0,0x1a0(%rsp)
52b6: 00 00 00 00 00
52bb: 48 c7 84 24 a8 01 00 movq $0x0,0x1a8(%rsp)
52c2: 00 00 00 00 00
52c7: c7 84 24 b0 01 00 00 movl $0x0,0x1b0(%rsp)
52ce: 00 00 00 00
52d2: 4c 8d 7c 24 20 lea 0x20(%rsp),%r15
52d7: 48 89 c6 mov %rax,%rsi
52da: 48 89 44 24 08 mov %rax,0x8(%rsp)
;; ***** BUG HERE *****
;; Pointer to int[] loaded into %rsi
;; But where is %rdi populated?
52df: e8 cc c5 ff ff callq 18b0 <helper_function>
0000000000005494 <perfectly_fine_invoker>:
5494: 48 83 ec 20 sub $0x20,%rsp
5498: 0f ae f0 mfence
549b: 48 8d 7c 24 30 lea 0x30(%rsp),%rdi
54a0: 48 89 e6 mov %rsp,%rsi
54a3: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
54aa: 00
54ab: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
54b2: 00 00
54b4: c7 44 24 10 00 00 00 movl $0x0,0x10(%rsp)
54bb: 00
;; Non buggy invocation here: both %rdi and %rsi loaded correctly.
54bc: e8 ef c3 ff ff callq 18b0 <helper_function>
Optimization Disabled
Now compiled with:
g++ -DHAVE_CONFIG_H -I. -fPIC -g -O0 -MT file.o -MD -MP -MF .deps/file.Tpo -c -o file.o file.cpp
0000000000008d27 <helper_function>:
;; Still the same parameters here, but it looks a little different.
... snip ...
8d2b: 48 89 7d e8 mov %rdi,-0x18(%rbp)
8d2f: 48 89 75 e0 mov %rsi,-0x20(%rbp)
8d33: 48 8b 45 e8 mov -0x18(%rbp),%rax
8d37: 0f b6 00 movzbl (%rax),%eax
8d3a: 0f b6 c0 movzbl %al,%eax
8d3d: 89 45 fc mov %eax,-0x4(%rbp)
8d40: 8b 45 fc mov -0x4(%rbp),%eax
8d43: 83 f8 17 cmp $0x17,%eax
8d46: 74 40 je 8d88 <helper_function+0x61>
...
000000000000948a <buggy_invoker>:
948a: 55 push %rbp
948b: 48 89 e5 mov %rsp,%rbp
948e: 41 54 push %r12
9490: 53 push %rbx
9491: 48 81 ec c0 01 00 00 sub $0x1c0,%rsp
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
949f: 48 89 b5 30 fe ff ff mov %rsi,-0x1d0(%rbp)
94a6: 48 c7 45 c0 00 00 00 movq $0x0,-0x40(%rbp)
94ad: 00
94ae: 48 c7 45 c8 00 00 00 movq $0x0,-0x38(%rbp)
94b5: 00
94b6: c7 45 d0 00 00 00 00 movl $0x0,-0x30(%rbp)
94bd: 48 8d 55 c0 lea -0x40(%rbp),%rdx
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
;; ***** NOT BUGGY HERE *****
;; Now, without optimization, both %rdi and %rsi loaded correctly.
94ce: e8 54 f8 ff ff callq 8d27 <helper_function>
0000000000008eec <different_perfectly_fine_invoker>:
8eec: 55 push %rbp
8eed: 48 89 e5 mov %rsp,%rbp
8ef0: 48 83 ec 30 sub $0x30,%rsp
8ef4: 48 89 7d d8 mov %rdi,-0x28(%rbp)
8ef8: 48 c7 45 e0 00 00 00 movq $0x0,-0x20(%rbp)
8eff: 00
8f00: 48 c7 45 e8 00 00 00 movq $0x0,-0x18(%rbp)
8f07: 00
8f08: c7 45 f0 00 00 00 00 movl $0x0,-0x10(%rbp)
8f0f: 48 8d 55 e0 lea -0x20(%rbp),%rdx
8f13: 48 8b 45 d8 mov -0x28(%rbp),%rax
8f17: 48 89 d6 mov %rdx,%rsi
8f1a: 48 89 c7 mov %rax,%rdi
;; Another example of non-optimized call to that function.
8f1d: e8 05 fe ff ff callq 8d27 <helper_function>
The Original C++ Code
This is a sanitized version of the original C++. I've just changed some names
and removed irrelevant code. Forgive my paranoia, I just don't want to expose
too much code from unpublished and unreleased work :-).
static void helper_function(my_struct_t *e, int *outArr)
{
unsigned char event_type = e->header.type;
if (event_type == event_A || event_type == event_B) {
outArr[0] = action_one;
} else if (event_type == event_C) {
outArr[0] = action_one;
outArr[1] = action_two;
} else if (...) { ... }
}
static void buggy_invoker(my_struct_t *e, predicate_t pred)
{
// MAX_ACTIONS is #defined to 5
int action_array[MAX_ACTIONS] = {0};
helper_function(e, action_array);
...
}
static int has_any_actions(my_struct_t *e)
{
int actions[MAX_ACTIONS] = {0};
helper_function(e, actions);
return actions[0] != 0;
}
// *** ENTRY POINT to this code is this function (note not static).
void perfectly_fine_invoker(my_struct_t e, predicate_t pred)
{
memfence();
if (has_any_actions(&e)) {
buggy_invoker(&e, pred);
}
...
}
If you think I've obfuscated or eliminiated too much, let me know. Users of
this code call 'perfectly_fine_invoker'. With optimization, g++ optimizes the
'has_any_actions' function away into a direct call to 'helper_function', which
you can see in the assembly.
The Question
So, my question is, does it look like a buggy optimization to anyone else?
If it would be helpful, I could post a sanitized version of the original C++ code.
This is my first posting to Stack Overflow, so please let me know if I can do
anything to make the question clearer, or provide any additional information.
The Answer
Edit (several days after the fact):
I accepted an answer below to my question -- it was not an optimization bug in g++, I was just looking at the assembly code wrong.
However, for whoever may be viewing this question in the future, I've found the answer. I did some reading on undefined behavior in C ( http://blog.regehr.org/archives/213 and http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html ) and some of the descriptions of the compiler optimizing away functions with undefined behavior seemed eerily familiar.
I added some NULL-pointer checks to the function 'helper_function' and lo and behold... bug goes away. I should have had the NULL-pointer checks to begin with, but apparently not having them allowed g++ to do whatever it wanted (in my case, optimize away the call).
Hope this information helps someone down the road.
I think you are looking at the wrong thing. I imagine the compiler notice that your function is short and doesn't touch the %rdi register so it just leaves it alone (you have the same variable as the first parameter, which I guess is what is placed in %rdi. See page 21 here http://www.x86-64.org/documentation/abi.pdf)
If you look at the unoptimized version it saves the %rdi register on this line
9498: 48 89 bd 38 fe ff ff mov %rdi,-0x1c8(%rbp)
...and then later just before calling helper_function it moves the saved value into %rax that is moved into %rdi.
94c1: 48 8b 85 38 fe ff ff mov -0x1c8(%rbp),%rax
94c8: 48 89 d6 mov %rdx,%rsi
94cb: 48 89 c7 mov %rax,%rdi
When optimizing it the compiler just get rid of all that moving back and forth.