why is inlined function slower than function pointer? - c++

Consider the following code:
typedef void (*Fn)();
volatile long sum = 0;
inline void accu() {
sum+=4;
}
static const Fn map[4] = {&accu, &accu, &accu, &accu};
int main(int argc, char** argv) {
static const long N = 10000000L;
if (argc == 1)
{
for (long i = 0; i < N; i++)
{
accu();
accu();
accu();
accu();
}
}
else
{
for (long i = 0; i < N; i++)
{
for (int j = 0; j < 4; j++)
(*map[j])();
}
}
}
When I compiled it with:
g++ -O3 test.cpp
I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.
But the results surprised me:
time ./a.out
real 0m0.108s
user 0m0.104s
sys 0m0.000s
time ./a.out 1
real 0m0.095s
user 0m0.088s
sys 0m0.004s
I don't understand why, so I did an objdump:
objdump -DStTrR a.out > a.s
and the disassembly doesn't seem to explain the performance result I got:
8048300 <main>:
8048300: 55 push %ebp
8048301: 89 e5 mov %esp,%ebp
8048303: 53 push %ebx
8048304: bb 80 96 98 00 mov $0x989680,%ebx
8048309: 83 e4 f0 and $0xfffffff0,%esp
804830c: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048310: 74 27 je 8048339 <main+0x39>
8048312: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
8048318: e8 23 01 00 00 call 8048440 <_Z4accuv>
804831d: e8 1e 01 00 00 call 8048440 <_Z4accuv>
8048322: e8 19 01 00 00 call 8048440 <_Z4accuv>
8048327: e8 14 01 00 00 call 8048440 <_Z4accuv>
804832c: 83 eb 01 sub $0x1,%ebx
804832f: 90 nop
8048330: 75 e6 jne 8048318 <main+0x18>
8048332: 31 c0 xor %eax,%eax
8048334: 8b 5d fc mov -0x4(%ebp),%ebx
8048337: c9 leave
8048338: c3 ret
8048339: b8 80 96 98 00 mov $0x989680,%eax
804833e: 66 90 xchg %ax,%ax
8048340: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048346: 83 c2 04 add $0x4,%edx
8048349: 89 15 18 a0 04 08 mov %edx,0x804a018
804834f: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048355: 83 c2 04 add $0x4,%edx
8048358: 89 15 18 a0 04 08 mov %edx,0x804a018
804835e: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048364: 83 c2 04 add $0x4,%edx
8048367: 89 15 18 a0 04 08 mov %edx,0x804a018
804836d: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048373: 83 c2 04 add $0x4,%edx
8048376: 83 e8 01 sub $0x1,%eax
8048379: 89 15 18 a0 04 08 mov %edx,0x804a018
804837f: 75 bf jne 8048340 <main+0x40>
8048381: eb af jmp 8048332 <main+0x32>
8048383: 90 nop
...
8048440 <_Z4accuv>:
8048440: a1 18 a0 04 08 mov 0x804a018,%eax
8048445: 83 c0 04 add $0x4,%eax
8048448: a3 18 a0 04 08 mov %eax,0x804a018
804844d: c3 ret
804844e: 90 nop
804844f: 90 nop
It seems the direct call branch is definitely doing less than the function pointer branch.
But why does the function pointer branch run faster than the direct call?
And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.

It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.
As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):
mov %edx,0x804a018
mov 0x804a018,%edx
Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:
... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.
The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.
Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:
long x1 = 0;
for (long i = 0; i < N; i++)
{
x1 ^= i<<8;
accu();
x1 ^= i<<1;
accu();
x1 ^= i<<2;
accu();
x1 ^= i<<4;
accu();
}
sum += x1;

Related

Which is better: returning tuple or passing arguments to function as references?

I created code where I have two functions returnValues and returnValuesVoid. One returns tuple of 2 values and other accept argument's references to the function.
#include <iostream>
#include <tuple>
std::tuple<int, int> returnValues(const int a, const int b) {
return std::tuple(a,b);
}
void returnValuesVoid(int &a,int &b) {
a += 100;
b += 100;
}
int main() {
auto [x,y] = returnValues(10,20);
std::cout << x ;
std::cout << y ;
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
std::cout << b ;
}
I read about http://en.cppreference.com/w/cpp/language/structured_binding
which can destruct tuple to auto [x,y] variables.
Is auto [x,y] = returnValues(10,20); better than passing by references? As I know it's slower because it does have to return tuple object and reference just works on orginal variables passed to function so there's no reason to use it except cleaner code.
As auto [x,y] is since C++17 do people use it on production? I see that it looks cleaner than returnValuesVoid which is void type and but does it have other advantages over passing by reference?
Look at disassemble (compiled with GCC -O3):
It takes more instruction to implement tuple call.
0000000000000000 <returnValues(int, int)>:
0: 83 c2 64 add $0x64,%edx
3: 83 c6 64 add $0x64,%esi
6: 48 89 f8 mov %rdi,%rax
9: 89 17 mov %edx,(%rdi)
b: 89 77 04 mov %esi,0x4(%rdi)
e: c3 retq
f: 90 nop
0000000000000010 <returnValuesVoid(int&, int&)>:
10: 83 07 64 addl $0x64,(%rdi)
13: 83 06 64 addl $0x64,(%rsi)
16: c3 retq
But less instructions for the tuple caller:
0000000000000000 <callTuple()>:
0: 48 83 ec 18 sub $0x18,%rsp
4: ba 14 00 00 00 mov $0x14,%edx
9: be 0a 00 00 00 mov $0xa,%esi
e: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
13: e8 00 00 00 00 callq 18 <callTuple()+0x18> // call returnValues
18: 8b 74 24 0c mov 0xc(%rsp),%esi
1c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
23: e8 00 00 00 00 callq 28 <callTuple()+0x28> // std::cout::operator<<
28: 8b 74 24 08 mov 0x8(%rsp),%esi
2c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
33: e8 00 00 00 00 callq 38 <callTuple()+0x38> // std::cout::operator<<
38: 48 83 c4 18 add $0x18,%rsp
3c: c3 retq
3d: 0f 1f 00 nopl (%rax)
0000000000000040 <callRef()>:
40: 48 83 ec 18 sub $0x18,%rsp
44: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
49: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
4e: c7 44 24 08 0a 00 00 movl $0xa,0x8(%rsp)
55: 00
56: c7 44 24 0c 14 00 00 movl $0x14,0xc(%rsp)
5d: 00
5e: e8 00 00 00 00 callq 63 <callRef()+0x23> // call returnValuesVoid
63: 8b 74 24 08 mov 0x8(%rsp),%esi
67: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
6e: e8 00 00 00 00 callq 73 <callRef()+0x33> // std::cout::operator<<
73: 8b 74 24 0c mov 0xc(%rsp),%esi
77: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7e: e8 00 00 00 00 callq 83 <callRef()+0x43> // std::cout::operator<<
83: 48 83 c4 18 add $0x18,%rsp
87: c3 retq
I don't think there is any considerable performance different, but the tuple one is more clear, more readable.
Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.
0000000000000000 <callTuple()>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7: 48 83 ec 08 sub $0x8,%rsp
b: be 6e 00 00 00 mov $0x6e,%esi
10: e8 00 00 00 00 callq 15 <callTuple()+0x15>
15: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
1c: be 78 00 00 00 mov $0x78,%esi
21: 48 83 c4 08 add $0x8,%rsp
25: e9 00 00 00 00 jmpq 2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
2a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000030 <callRef()>:
30: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
37: 48 83 ec 08 sub $0x8,%rsp
3b: be 6e 00 00 00 mov $0x6e,%esi
40: e8 00 00 00 00 callq 45 <callRef()+0x15>
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
4c: be 78 00 00 00 mov $0x78,%esi
51: 48 83 c4 08 add $0x8,%rsp
55: e9 00 00 00 00 jmpq 5a <callRef()+0x2a> // TCO, optimized way to call a function and also return
Focus on what's more readable and which approach provides a better intuition to the reader, and please keep the performance issues you might think that arise in the background.
A function that returns a tuple (or a pair, a struct, etc.) is yelling to the author that the function returns something, that almost always has some meaning that the user can take into account.
A function that gives back the results in variables passed by reference, may slip the eye's attention of a tired reader.
So, in general, prefer to return the results by a tuple.
Mike van Dyke pointed to this link:
F.21: To return multiple "out" values, prefer returning a tuple or struct
Reason
A return value is self-documenting as an "output-only"
value. Note that C++ does have multiple return values, by convention
of using a tuple (including pair), possibly with the extra convenience
of tie at the call site.
[...]
Exception
Sometimes, we need to pass an object to a function to manipulate its state. In such cases, passing the object by reference T& is usually the right technique.
Using another compiler (VS 2017) the resulting code shows no difference, as the function calls are just optimized away.
int main() {
00007FF6A9C51E50 sub rsp,28h
auto [x,y] = returnValues(10,20);
std::cout << x ;
00007FF6A9C51E54 mov edx,0Ah
00007FF6A9C51E59 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << y ;
00007FF6A9C51E5E mov edx,14h
00007FF6A9C51E63 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
00007FF6A9C51E68 mov edx,6Eh
00007FF6A9C51E6D call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << b ;
00007FF6A9C51E72 mov edx,78h
00007FF6A9C51E77 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
}
00007FF6A9C51E7C xor eax,eax
00007FF6A9C51E7E add rsp,28h
00007FF6A9C51E82 ret
So using clearer code seems to be the obvious choice.
What Zang said is true but not up-to the point. I ran the code provided in question with chrono to measure time. I think the answer needs to be edited after observing what happened.
For 1M iterations, time taken by function call via reference was 3ms while time taken by function call via std::tie combined with std::tuple was about 94ms.
Though the difference seems very less in practice, still tuple one will perform slightly slower. Hence, for performance intensive systems, I suggest using call by reference.
My code:
#include <iostream>
#include <tuple>
#include <chrono>
std::tuple<int, int> returnValues(const int a, const int b)
{
return std::tuple<int, int>(a, b);
}
void returnValuesVoid(int &a, int &b)
{
a += 100;
b += 100;
}
int main()
{
int a = 10, b = 20;
auto begin = std::chrono::high_resolution_clock::now();
int x, y;
for (int i = 0; i < 1000000; i++)
{
std::tie(x, y) = returnValues(a, b);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()) << '\n';
a = 10;
b = 20;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++)
{
returnValuesVoid(a, b);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count()) << '\n';
}

Why is this AVX code slower?

Updated: 19 Aug. 2017, 16:49 UTC
I’m writing an AVX code to multiply a vector with 4 billion components by a constant, however, I see no difference between my small -- I hope -- optimized AVX code and the long scalar compiler optimized version.
Both versions run between 410 ms - 400 ms.
Can someone tell me why it is occurring?
And why the large assembly generated by the compiler code takes almost the same time even it's larger ?
It's an important question, because if small computations -- like this multiplication -- have no improvement then it has no sense to use made the manual code in an Intel Core CPU. Perhaps in an Intel Xeon ( with 16 components ) or for more complex computations.
I'm compiling with G++ with parameters:
g++ -O3 -mtune=native -march=native -mavx -g3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/Test AVX.d" -MT"src/Test\ AVX.d" -o "src/Test AVX.o" "../src/Test AVX.cpp"
My CPU is a Intel(R) Core(TM) i5-5200U CPU # 2.20GHz.
There is the AVX code:
/**
* Run AVX Code
*/
void AVX() {
// Loop control
uint_fast32_t loop = 0;
// The constant
__m256 _const = _mm256_set1_ps(5.0f);
// The register for multiplication
__m256 _ymm0 = _mm256_setzero_ps();
// A "buffer" between the vector and the YMM0 register
float f_data[8];
// The main loop
for ( loop = 0 ; loop < SIZE ; loop = loop + 8 ) {
// Load to buffer
f_data[0] = vector[loop];
f_data[1] = vector[loop+1];
f_data[2] = vector[loop+2];
f_data[3] = vector[loop+3];
f_data[4] = vector[loop+4];
f_data[5] = vector[loop+5];
f_data[6] = vector[loop+6];
f_data[7] = vector[loop+7];
/*
* I tried to use pointers insted to copy
* the data, but the software crash
*
* float **f_data;
* f_data = float*[8];
*
* f_data[0] = &vector[loop];
* ...
*
*/
// Load to XMM and YMM Registers
_ymm0 = _mm256_load_ps(f_data);
// Do the multiplication
_ymm0 = _mm256_mul_ps(_ymm0,_const);
// Copy the results from the register to the "buffer"
_mm256_store_ps(f_data,_ymm0);
// Copy from the "buffer" to the vector
vector[loop] = f_data[0];
vector[loop+1] = f_data[1];
vector[loop+2] = f_data[2];
vector[loop+3] = f_data[3];
vector[loop+4] = f_data[4];
vector[loop+5] = f_data[5];
vector[loop+6] = f_data[6];
vector[loop+7] = f_data[7];
}
}
The AVX assembled:
0000000000400de0 <_Z3AVXv>:
400de0: 48 8b 05 b1 13 20 00 mov rax,QWORD PTR [rip+0x2013b1] # 602198 <vector>
400de7: c5 fc 28 0d 71 06 00 vmovaps ymm1,YMMWORD PTR [rip+0x671] # 401460 <_IO_stdin_used+0x40>
400dee: 00
400def: 48 8d 90 00 00 00 40 lea rdx,[rax+0x40000000]
400df6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
400dfd: 00 00 00
400e00: c5 f4 59 00 vmulps ymm0,ymm1,YMMWORD PTR [rax]
400e04: 48 83 c0 20 add rax,0x20
400e08: c5 fc 11 40 e0 vmovups YMMWORD PTR [rax-0x20],ymm0
400e0d: 48 39 c2 cmp rdx,rax
400e10: 75 ee jne 400e00 <_Z3AVXv+0x20>
400e12: c5 f8 77 vzeroupper
400e15: c3 ret
400e16: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
400e1d: 00 00 00
The Serial Version:
/**
* Run Compiler optimized version
*/
void Serial() {
uint_fast32_t loop;
// Do the multiplication
for ( loop = 0 ; loop < SIZE ; loop ++)
vector[loop] *= 5;
}
The serial assembled:
It's more large, move the data more times and take almost the same time. How it's possible ?
0000000000400e80 <_Z6Serialv>:
400e80: 48 8b 35 11 13 20 00 mov rsi,QWORD PTR [rip+0x201311] # 602198 <vector>
400e87: 48 89 f0 mov rax,rsi
400e8a: 48 c1 e8 02 shr rax,0x2
400e8e: 48 f7 d8 neg rax
400e91: 83 e0 07 and eax,0x7
400e94: 0f 84 96 01 00 00 je 401030 <_Z6Serialv+0x1b0>
400e9a: c5 fa 10 05 7a 04 00 vmovss xmm0,DWORD PTR [rip+0x47a] # 40131c <_IO_stdin_used+0x1c>
400ea1: 00
400ea2: c5 fa 59 0e vmulss xmm1,xmm0,DWORD PTR [rsi]
400ea6: c5 fa 11 0e vmovss DWORD PTR [rsi],xmm1
400eaa: 48 83 f8 01 cmp rax,0x1
400eae: 0f 84 8c 01 00 00 je 401040 <_Z6Serialv+0x1c0>
400eb4: c5 fa 59 4e 04 vmulss xmm1,xmm0,DWORD PTR [rsi+0x4]
400eb9: c5 fa 11 4e 04 vmovss DWORD PTR [rsi+0x4],xmm1
400ebe: 48 83 f8 02 cmp rax,0x2
400ec2: 0f 84 89 01 00 00 je 401051 <_Z6Serialv+0x1d1>
400ec8: c5 fa 59 4e 08 vmulss xmm1,xmm0,DWORD PTR [rsi+0x8]
400ecd: c5 fa 11 4e 08 vmovss DWORD PTR [rsi+0x8],xmm1
400ed2: 48 83 f8 03 cmp rax,0x3
400ed6: 0f 84 86 01 00 00 je 401062 <_Z6Serialv+0x1e2>
400edc: c5 fa 59 4e 0c vmulss xmm1,xmm0,DWORD PTR [rsi+0xc]
400ee1: c5 fa 11 4e 0c vmovss DWORD PTR [rsi+0xc],xmm1
400ee6: 48 83 f8 04 cmp rax,0x4
400eea: 0f 84 2d 01 00 00 je 40101d <_Z6Serialv+0x19d>
400ef0: c5 fa 59 4e 10 vmulss xmm1,xmm0,DWORD PTR [rsi+0x10]
400ef5: c5 fa 11 4e 10 vmovss DWORD PTR [rsi+0x10],xmm1
400efa: 48 83 f8 05 cmp rax,0x5
400efe: 0f 84 6f 01 00 00 je 401073 <_Z6Serialv+0x1f3>
400f04: c5 fa 59 4e 14 vmulss xmm1,xmm0,DWORD PTR [rsi+0x14]
400f09: c5 fa 11 4e 14 vmovss DWORD PTR [rsi+0x14],xmm1
400f0e: 48 83 f8 06 cmp rax,0x6
400f12: 0f 84 6c 01 00 00 je 401084 <_Z6Serialv+0x204>
400f18: c5 fa 59 46 18 vmulss xmm0,xmm0,DWORD PTR [rsi+0x18]
400f1d: 41 b9 f9 ff ff 0f mov r9d,0xffffff9
400f23: 41 ba 07 00 00 00 mov r10d,0x7
400f29: c5 fa 11 46 18 vmovss DWORD PTR [rsi+0x18],xmm0
400f2e: 41 b8 00 00 00 10 mov r8d,0x10000000
400f34: c5 fc 28 0d 04 04 00 vmovaps ymm1,YMMWORD PTR [rip+0x404] # 401340 <_IO_stdin_used+0x40>
400f3b: 00
400f3c: 48 8d 0c 86 lea rcx,[rsi+rax*4]
400f40: 31 d2 xor edx,edx
400f42: 49 29 c0 sub r8,rax
400f45: 31 c0 xor eax,eax
400f47: 4c 89 c7 mov rdi,r8
400f4a: 48 c1 ef 03 shr rdi,0x3
400f4e: 66 90 xchg ax,ax
400f50: c5 f4 59 04 01 vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1]
400f55: 48 83 c2 01 add rdx,0x1
400f59: c5 fc 29 04 01 vmovaps YMMWORD PTR [rcx+rax*1],ymm0
400f5e: 48 83 c0 20 add rax,0x20
400f62: 48 39 d7 cmp rdi,rdx
400f65: 77 e9 ja 400f50 <_Z6Serialv+0xd0>
400f67: 4c 89 c1 mov rcx,r8
400f6a: 4c 89 ca mov rdx,r9
400f6d: 48 83 e1 f8 and rcx,0xfffffffffffffff8
400f71: 49 8d 04 0a lea rax,[r10+rcx*1]
400f75: 48 29 ca sub rdx,rcx
400f78: 49 39 c8 cmp r8,rcx
400f7b: 0f 84 98 00 00 00 je 401019 <_Z6Serialv+0x199>
400f81: 48 8d 0c 86 lea rcx,[rsi+rax*4]
400f85: c5 fa 10 05 8f 03 00 vmovss xmm0,DWORD PTR [rip+0x38f] # 40131c <_IO_stdin_used+0x1c>
400f8c: 00
400f8d: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400f91: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400f95: 48 8d 48 01 lea rcx,[rax+0x1]
400f99: 48 83 fa 01 cmp rdx,0x1
400f9d: 74 7a je 401019 <_Z6Serialv+0x199>
400f9f: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fa3: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fa7: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fab: 48 8d 48 02 lea rcx,[rax+0x2]
400faf: 48 83 fa 02 cmp rdx,0x2
400fb3: 74 64 je 401019 <_Z6Serialv+0x199>
400fb5: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fb9: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fbd: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fc1: 48 8d 48 03 lea rcx,[rax+0x3]
400fc5: 48 83 fa 03 cmp rdx,0x3
400fc9: 74 4e je 401019 <_Z6Serialv+0x199>
400fcb: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fcf: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fd3: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fd7: 48 8d 48 04 lea rcx,[rax+0x4]
400fdb: 48 83 fa 04 cmp rdx,0x4
400fdf: 74 38 je 401019 <_Z6Serialv+0x199>
400fe1: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400fe5: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
400fe9: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
400fed: 48 8d 48 05 lea rcx,[rax+0x5]
400ff1: 48 83 fa 05 cmp rdx,0x5
400ff5: 74 22 je 401019 <_Z6Serialv+0x199>
400ff7: 48 8d 0c 8e lea rcx,[rsi+rcx*4]
400ffb: 48 83 c0 06 add rax,0x6
400fff: c5 fa 59 09 vmulss xmm1,xmm0,DWORD PTR [rcx]
401003: c5 fa 11 09 vmovss DWORD PTR [rcx],xmm1
401007: 48 83 fa 06 cmp rdx,0x6
40100b: 74 0c je 401019 <_Z6Serialv+0x199>
40100d: 48 8d 04 86 lea rax,[rsi+rax*4]
401011: c5 fa 59 00 vmulss xmm0,xmm0,DWORD PTR [rax]
401015: c5 fa 11 00 vmovss DWORD PTR [rax],xmm0
401019: c5 f8 77 vzeroupper
40101c: c3 ret
40101d: 41 ba 04 00 00 00 mov r10d,0x4
401023: 41 b9 fc ff ff 0f mov r9d,0xffffffc
401029: e9 00 ff ff ff jmp 400f2e <_Z6Serialv+0xae>
40102e: 66 90 xchg ax,ax
401030: 41 b9 00 00 00 10 mov r9d,0x10000000
401036: 45 31 d2 xor r10d,r10d
401039: e9 f0 fe ff ff jmp 400f2e <_Z6Serialv+0xae>
40103e: 66 90 xchg ax,ax
401040: 41 b9 ff ff ff 0f mov r9d,0xfffffff
401046: 41 ba 01 00 00 00 mov r10d,0x1
40104c: e9 dd fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401051: 41 ba 02 00 00 00 mov r10d,0x2
401057: 41 b9 fe ff ff 0f mov r9d,0xffffffe
40105d: e9 cc fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401062: 41 ba 03 00 00 00 mov r10d,0x3
401068: 41 b9 fd ff ff 0f mov r9d,0xffffffd
40106e: e9 bb fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401073: 41 ba 05 00 00 00 mov r10d,0x5
401079: 41 b9 fb ff ff 0f mov r9d,0xffffffb
40107f: e9 aa fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401084: 41 ba 06 00 00 00 mov r10d,0x6
40108a: 41 b9 fa ff ff 0f mov r9d,0xffffffa
401090: e9 99 fe ff ff jmp 400f2e <_Z6Serialv+0xae>
401095: 90 nop
401096: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
40109d: 00 00 00
The full code:
#include <iostream>
#include <xmmintrin.h>
#include <immintrin.h>
using namespace std;
/**
* The vector size
* 268435456 -> 32*8388608 -> 2^32
*/
#define SIZE 268435456
/**
* The vector for computations
*/
float *vector;
/**
* Run AVX Code
*/
void AVX() { ... }
/**
* Run Compiler optimized version
*/
void Serial() { ... }
/**
* Create the vector
*/
void create() {
vector = new float[SIZE];
}
/**
* Fill the vector with data
* to be used for validation
*/
void fill() {
uint_fast32_t loop = 0;
// Fill the vector
for ( loop = 0 ; loop < SIZE ; loop++ )
vector[loop] = 1;
}
/**
* A validation to ensure the compiler have
* computed all the vector data
*/
void validation() {
// The loop variable
unsigned long loop = 0;
unsigned long errors = 0;
unsigned long checks = 0;
for ( loop = 0 ; loop < SIZE ; loop ++ ) {
// All the vector must be 5
if ( vector[loop] != 5 ) {
errors ++;
// To avoid to show too many errors
if ( errors < 12 )
std::cout << loop << ": " << vector[loop] << std::endl;
}
checks ++;
}
// The result
std::cout << "Errors: " << errors << "\nChecks: " << checks << std::endl;
}
int main() {
// Create the vector
create();
// Fill with data
//fill();
// The tests
//Serial();
AVX();
/*
* To ensure that the g++ optimization have executed the loop
*/
//validation();
}
Compiled with:
g++ -O3 -mtune=native -march=native -mavx -g3 -Wall -c -fmessage-length=0 -MMD -MP -MF"src/Test AVX.d" -MT"src/Test\ AVX.d" -o "src/Test AVX.o" "../src/Test AVX.cpp"
Multiplying by 5 is so trivial that you should do that on the fly next time you read the array, or fold it into the code that wrote this array. Loading all that data from RAM into the CPU and storing it back again just to multiply by 5.0 is not efficient.
If you can't just fold it into a different pass of your algorithm, try cache-blocking aka loop-tiling to run multiple steps of your algorithm over a part of this array that fits into cache, before moving on to the next cache-sized block.
Your scalar code auto-vectorizes to nearly the same inner loop as your manually-vectorized version. Neither one is unrolled at all.
The extra code size in gcc's version is just scalar startup / cleanup so its inner loop can use aligned loads/stores. gcc fully unrolls those loops.
Also note that your manually-vectorized code doesn't handle the case where SIZE is not a multiple of 8. (gcc does handle the cleanup at the end even then, because it doesn't know where the alignment boundary will be.)
clang usually just uses unaligned loads/stores on arrays that it can't prove at compile time are always aligned. gcc's default behaviour is maybe good for large arrays that actually are misaligned at run-time, but a total waste of I-cache and branches for cases where the data is in fact aligned at run time most of the time, or for small arrays where doing a bunch of branching and scalar iterations isn't worth it.
The inner loops are nearly the same. In your manually vectorized version, gcc managed to optimize away the element-by-element copy through f_data and emit what you would get from _mm256_loadu_ps(&vector[loop]), instead of actually copying to a local and then doing a vector load. And same for storing back into vector[], luckily for you.
# top of inner loop in the manually-vectorized version:
400e00: c5 f4 59 00 vmulps ymm0,ymm1,YMMWORD PTR [rax]
400e04: 48 83 c0 20 add rax,0x20
400e08: c5 fc 11 40 e0 vmovups YMMWORD PTR [rax-0x20],ymm0
400e0d: 48 39 c2 cmp rdx,rax
400e10: 75 ee jne 400e00 <_Z3AVXv+0x20>
gcc's inner loop uses a loop counter separate from the pointer, so it has an extra instruction, and it uses an indexed addressing mode. vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1] can't stay micro-fused on Haswell, so it will issue as 2 fused-domain uops.
# top of gcc's inner loop:
400f50: c5 f4 59 04 01 vmulps ymm0,ymm1,YMMWORD PTR [rcx+rax*1]
400f55: 48 83 c2 01 add rdx,0x1
400f59: c5 fc 29 04 01 vmovaps YMMWORD PTR [rcx+rax*1],ymm0
400f5e: 48 83 c0 20 add rax,0x20
400f62: 48 39 d7 cmp rdi,rdx
400f65: 77 e9 ja 400f50 <_Z6Serialv+0xd0>
The extra add instruction is another extra uop. This is 6 fused-domain uops (and thus can run at best one iteration per 1.5 cycles, bottlenecked on the front-end).
Your manual version is only 4 fused-domain uops, so it can issue at 1 per clock. It can in theory run that fast if the buffer is hot in L1D cache (or maybe L2), also limited by 1 store per clock.
Of course, since you're running it over a giant buffer, you just bottleneck on memory bandwidth. The minor front-end bottleneck in the auto-vectorized version is a total non-issue. Even an SSE2 version would barely run slower.
You said something about a Xeon with 16 cores. If you want gcc to auto-parallelize as well as SIMD vectorize, you could use OpenMP. As it is, your code is purely single-threaded.

Why is this version of strcmp slower?

I have been trying experiment with improving performance of strcmp under certain conditions. However, I unfortunately cannot even get an implementation of plain vanilla strcmp to perform as well as the library implementation.
I saw a similar question, but the answers say the difference was from the compiler optimizing away the comparison on string literals. My test does not use string literals.
Here's the implementation (comparisons.cpp)
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
}
And here's the test driver (driver.cpp):
#include "comparisons.h"
#include <array>
#include <chrono>
#include <iostream>
void init_string(char* str, int nChars) {
// 10% of strings will be equal, and 90% of strings will have one char different.
// This way, many strings will share long prefixes so strcmp has to exercise a bit.
// Using random strings still shows the custom implementation as slower (just less so).
str[nChars - 1] = '\0';
for (int i = 0; i < nChars - 1; i++)
str[i] = (i % 94) + 32;
if (rand() % 10 != 0)
str[rand() % (nChars - 1)] = 'x';
}
int main(int argc, char** argv) {
srand(1234);
// Pre-generate some strings to compare.
const int kSampleSize = 100;
std::array<char[1024], kSampleSize> strings;
for (int i = 0; i < kSampleSize; i++)
init_string(strings[i], kSampleSize);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp(strings[i], strings[j]);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp - " << (end - start).count() << std::endl;
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp_custom(strings[i], strings[j]);
end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}
And my makefile:
CC=clang++
test: driver.o comparisons.o
$(CC) -o test driver.o comparisons.o
# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
$(CC) -c -o driver.o -std=c++11 -O0 driver.cpp
# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
$(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp
clean:
rm comparisons.o driver.o test
On the advice of this answer, I compiled my comparison function in a separate compilation unit with optimizations and compiled the driver with optimizations turned off, but I still get a slowdown of about 5x.
strcmp - 154519
strcmp_custom - 506282
I also tried copying the FreeBSD implementation but got similar results.
I'm wondering if my performance measurement is overlooking something. Or is the standard library implementation doing something fancier?
I don't know which standard library you have, but just to give you an idea of how serious C library maintainers are about optimizing the string primitives, the default strcmp used by GNU libc on x86-64 is two thousand lines of hand-optimized assembly language, as of version 2.24. There are separate, also hand-optimized, versions for when the SSSE3 and SSE4.2 instruction set extensions are available. (A fair bit of the complexity in that file appears to be because the same source code is used to generate several other functions; the machine code winds up being "only" 1120 instructions.) 2.24 was released roughly a year ago, and even more work has gone into it since.
They go to this much trouble because it's common for one of the string primitives to be the single hottest function in a profile.
Excerpts from my disassembly of glibc v2.2.5, x86_64 linux:
0000000000089cd0 <strcmp##GLIBC_2.2.5>:
89cd0: 48 8b 15 99 a1 33 00 mov 0x33a199(%rip),%rdx # 3c3e70 <_IO_file_jumps##GLIBC_2.2.5+0x790>
89cd7: 48 8d 05 92 58 01 00 lea 0x15892(%rip),%rax # 9f570 <strerror_l##GLIBC_2.6+0x200>
89cde: f7 82 b0 00 00 00 10 testl $0x10,0xb0(%rdx)
89ce5: 00 00 00
89ce8: 75 1a jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cea: 48 8d 05 9f 48 0c 00 lea 0xc489f(%rip),%rax # 14e590 <__nss_passwd_lookup##GLIBC_2.2.5+0x9c30>
89cf1: f7 82 80 00 00 00 00 testl $0x200,0x80(%rdx)
89cf8: 02 00 00
89cfb: 75 07 jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cfd: 48 8d 05 0c 00 00 00 lea 0xc(%rip),%rax # 89d10 <strcmp##GLIBC_2.2.5+0x40>
89d04: c3 retq
89d05: 90 nop
89d06: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
89d0d: 00 00 00
89d10: 89 f1 mov %esi,%ecx
89d12: 89 f8 mov %edi,%eax
89d14: 48 83 e1 3f and $0x3f,%rcx
89d18: 48 83 e0 3f and $0x3f,%rax
89d1c: 83 f9 30 cmp $0x30,%ecx
89d1f: 77 3f ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d21: 83 f8 30 cmp $0x30,%eax
89d24: 77 3a ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d26: 66 0f 12 0f movlpd (%rdi),%xmm1
89d2a: 66 0f 12 16 movlpd (%rsi),%xmm2
89d2e: 66 0f 16 4f 08 movhpd 0x8(%rdi),%xmm1
89d33: 66 0f 16 56 08 movhpd 0x8(%rsi),%xmm2
89d38: 66 0f ef c0 pxor %xmm0,%xmm0
89d3c: 66 0f 74 c1 pcmpeqb %xmm1,%xmm0
89d40: 66 0f 74 ca pcmpeqb %xmm2,%xmm1
89d44: 66 0f f8 c8 psubb %xmm0,%xmm1
89d48: 66 0f d7 d1 pmovmskb %xmm1,%edx
89d4c: 81 ea ff ff 00 00 sub $0xffff,%edx
...
The real thing is 1183 lines of assembly, with lots of potential cleverness about detecting system features and vectorized instructions. libc maintainers know that they can get an edge by just optimizing some of the functions called thousands of times by applications.
For comparison, your version at -O3:
comparisons.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
0: 8a 0e mov (%rsi),%cl
2: 8a 07 mov (%rdi),%al
4: 38 c1 cmp %al,%cl
6: 75 1e jne 26 <_Z13strcmp_customPKcS0_+0x26>
if (*a == '\0') return 0;
8: 48 ff c6 inc %rsi
b: 48 ff c7 inc %rdi
e: 66 90 xchg %ax,%ax
10: 31 c0 xor %eax,%eax
12: 84 c9 test %cl,%cl
14: 74 18 je 2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
16: 0f b6 0e movzbl (%rsi),%ecx
19: 0f b6 07 movzbl (%rdi),%eax
1c: 48 ff c6 inc %rsi
1f: 48 ff c7 inc %rdi
22: 38 c1 cmp %al,%cl
24: 74 ea je 10 <_Z13strcmp_customPKcS0_+0x10>
26: 0f be d0 movsbl %al,%edx
29: 0f be c1 movsbl %cl,%eax
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
2c: 29 d0 sub %edx,%eax
}
2e: c3 retq

Member initializer list, pointer initialization without argument

In a large framework which used to use many smart pointers and now uses raw pointers, I come across situations like this quite often:
class A {
public:
int* m;
A() : m() {}
};
The reason is because int* m used to be a smart pointer and so the initializer list called a default constructor. Now that int* m is a raw pointer I am not certain if this is equivalent to:
class A {
public:
int* m;
A() : m(nullptr) {}
};
Without the explicit nullptr is A::m still initialized to zero? A look at no optimization objdump -d makes it appear to be yes but I am not certain. The reason I feel that the answer is yes is due to this line in the objdump -d (I posted more of the objdump -d below):
400644: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
Little program that tries to find undefined behavior:
class A {
public:
int* m;
A() : m(nullptr) {}
};
int main() {
A buf[1000000];
unsigned int count = 0;
for (unsigned int i = 0; i < 1000000; ++i) {
count += buf[i].m ? 1 : 0;
}
return count;
}
Compilation, execution, and return value:
g++ -std=c++14 -O0 foo.cpp
./a.out; echo $?
0
Relevant assembly sections from objdump -d:
00000000004005b8 <main>:
4005b8: 55 push %rbp
4005b9: 48 89 e5 mov %rsp,%rbp
4005bc: 41 54 push %r12
4005be: 53 push %rbx
4005bf: 48 81 ec 10 12 7a 00 sub $0x7a1210,%rsp
4005c6: 48 8d 85 e0 ed 85 ff lea -0x7a1220(%rbp),%rax
4005cd: bb 3f 42 0f 00 mov $0xf423f,%ebx
4005d2: 49 89 c4 mov %rax,%r12
4005d5: eb 10 jmp 4005e7 <main+0x2f>
4005d7: 4c 89 e7 mov %r12,%rdi
4005da: e8 59 00 00 00 callq 400638 <_ZN1AC1Ev>
4005df: 49 83 c4 08 add $0x8,%r12
4005e3: 48 83 eb 01 sub $0x1,%rbx
4005e7: 48 83 fb ff cmp $0xffffffffffffffff,%rbx
4005eb: 75 ea jne 4005d7 <main+0x1f>
4005ed: c7 45 ec 00 00 00 00 movl $0x0,-0x14(%rbp)
4005f4: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp)
4005fb: eb 23 jmp 400620 <main+0x68>
4005fd: 8b 45 e8 mov -0x18(%rbp),%eax
400600: 48 8b 84 c5 e0 ed 85 mov -0x7a1220(%rbp,%rax,8),%rax
400607: ff
400608: 48 85 c0 test %rax,%rax
40060b: 74 07 je 400614 <main+0x5c>
40060d: b8 01 00 00 00 mov $0x1,%eax
400612: eb 05 jmp 400619 <main+0x61>
400614: b8 00 00 00 00 mov $0x0,%eax
400619: 01 45 ec add %eax,-0x14(%rbp)
40061c: 83 45 e8 01 addl $0x1,-0x18(%rbp)
400620: 81 7d e8 3f 42 0f 00 cmpl $0xf423f,-0x18(%rbp)
400627: 76 d4 jbe 4005fd <main+0x45>
400629: 8b 45 ec mov -0x14(%rbp),%eax
40062c: 48 81 c4 10 12 7a 00 add $0x7a1210,%rsp
400633: 5b pop %rbx
400634: 41 5c pop %r12
400636: 5d pop %rbp
400637: c3 retq
0000000000400638 <_ZN1AC1Ev>:
400638: 55 push %rbp
400639: 48 89 e5 mov %rsp,%rbp
40063c: 48 89 7d f8 mov %rdi,-0x8(%rbp)
400640: 48 8b 45 f8 mov -0x8(%rbp),%rax
400644: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
40064b: 5d pop %rbp
40064c: c3 retq
40064d: 0f 1f 00 nopl (%rax)
Empty () initializer stands for default-initialization in C++98 and for value-initialization in C++03 and later. For scalar types (including pointers) value-initialization/default-initialization leads to zero-initialization.
Which means that in your case m() and m(nullptr) will have exactly the same effect: in both cases m is initialized as a null pointer. In C++ it was like that since the beginning of standardized times.

assembly code for C++ scoped static initialization

i read some old articles about the local scoped static variable initialzation order problem from
C++ scoped static initialization is not thread-safe back in 2004, and
Function Static Variables in Multi-Threaded Environments in 2006.
then I start to produce an example and check my compiler, gcc 4.4.7
int calcSomething(){}
void foo(){
static int x = calcSomething();
}
int main(){
foo();
return 0;
}
the result from objdump shows:
000000000040061a <_Z3foov>:
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
400628: 75 28 jne 400652 <_Z3foov+0x38>
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
400652: c9 leaveq
400653: c3 retq
unfortunately, my knowledge of asssmbly code is so limited that I cannot tell what compiler does here. Can anyone shed me some light, what this assembly code do? and is it still not thread-safe? I really appreciate some "pseudo code" showing what gcc is doing here.
EDIT-1:
as Jerry commented, I enabled optimization with O2, the assembly code is:
0000000000400620 <_Z3foov>:
400620: 48 83 ec 08 sub $0x8,%rsp
400624: 80 3d 85 04 20 00 00 cmpb $0x0,0x200485(%rip) # 600ab0 <_ZGVZ3foovE1x>
40062b: 74 0b je 400638 <_Z3foov+0x18>
40062d: 48 83 c4 08 add $0x8,%rsp
400631: c3 retq
400632: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400638: bf b0 0a 60 00 mov $0x600ab0,%edi
40063d: e8 9e fe ff ff callq 4004e0 <__cxa_guard_acquire#plt>
400642: 85 c0 test %eax,%eax
400644: 74 e7 je 40062d <_Z3foov+0xd>
400646: c7 05 68 04 20 00 00 movl $0x0,0x200468(%rip) # 600ab8 <_ZZ3foovE1x>
40064d: 00 00 00
400650: bf b0 0a 60 00 mov $0x600ab0,%edi
400655: 48 83 c4 08 add $0x8,%rsp
400659: e9 a2 fe ff ff jmpq 400500 <__cxa_guard_release#plt>
40065e: 66 90 xchg %ax,%ax
Yes. In pseudocode (for the un-optimized case) it's something like:
if (flag_val() != 0) goto done;
if (guard_acquire() != 0) goto done;
x = calcSomething();
guard_release_and_set_flag();
// Note releasing the guard lock causes later
// calls to flag_val() to return non-zero.
done: return
The flag_val() is really a non-blocking check, apparently for efficiency to avoid calling the acquire primitive unless necessary. The flag must be set by guard_release as shown. The acquire seems to be the synchronized call to grab the lock. Only one thread will get a true value back and perform the initialization. After it releases the lock, the non-zero flag prevents any further touches of the lock.
Another interesting tidbit is that the guard data structure is 8 bytes away from the value of x itself in static memory.
Those familiar with the singleton pattern in languages with built-in threads e.g. Java will recognize this!
Addition
A bit more time now, so in a bit more detail:
000000000040061a <_Z3foov>:
; Prepare to access stack variables (never used in un-optimized code).
40061a: 55 push %rbp
40061b: 48 89 e5 mov %rsp,%rbp
; Test a byte 8 away from the static int x. This is apparently an "initialized" flag.
40061e: b8 d0 0a 60 00 mov $0x600ad0,%eax
400623: 0f b6 00 movzbl (%rax),%eax
400626: 84 c0 test %al,%al
; Goto the end of the function if the byte was no-zero.
400628: 75 28 jne 400652 <_Z3foov+0x38>
; Load the same byte address in di: the argument for the call to
; acquire the guard lock.
40062a: bf d0 0a 60 00 mov $0x600ad0,%edi
40062f: e8 bc fe ff ff callq 4004f0 <__cxa_guard_acquire#plt>
; Test the return value. Goto end of function if not zero (non-optimized code).
400634: 85 c0 test %eax,%eax
400636: 0f 95 c0 setne %al
400639: 84 c0 test %al,%al
40063b: 74 15 je 400652 <_Z3foov+0x38>
; Call the user's initialization function and move result into x.
40063d: e8 d2 ff ff ff callq 400614 <_Z13calcSomethingv>
400642: 89 05 90 04 20 00 mov %eax,0x200490(%rip) # 600ad8 <_ZZ3foovE1x>
; Load the guard byte's address again and call the release routine.
; This must set the flag to non-zero.
400648: bf d0 0a 60 00 mov $0x600ad0,%edi
40064d: e8 be fe ff ff callq 400510 <__cxa_guard_release#plt>
; Restore state and return.
400652: c9 leaveq
400653: c3 retq
This listing, although for the LLVM compiler rather than g++ (are you running OS X? OS X aliases g++ to LLVM), agrees with the guesswork above. The set_initialized routine is setting a flag value in guard_release.