Why is this version of strcmp slower? - c++

I have been trying experiment with improving performance of strcmp under certain conditions. However, I unfortunately cannot even get an implementation of plain vanilla strcmp to perform as well as the library implementation.
I saw a similar question, but the answers say the difference was from the compiler optimizing away the comparison on string literals. My test does not use string literals.
Here's the implementation (comparisons.cpp)
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
}
And here's the test driver (driver.cpp):
#include "comparisons.h"
#include <array>
#include <chrono>
#include <iostream>
void init_string(char* str, int nChars) {
// 10% of strings will be equal, and 90% of strings will have one char different.
// This way, many strings will share long prefixes so strcmp has to exercise a bit.
// Using random strings still shows the custom implementation as slower (just less so).
str[nChars - 1] = '\0';
for (int i = 0; i < nChars - 1; i++)
str[i] = (i % 94) + 32;
if (rand() % 10 != 0)
str[rand() % (nChars - 1)] = 'x';
}
int main(int argc, char** argv) {
srand(1234);
// Pre-generate some strings to compare.
const int kSampleSize = 100;
std::array<char[1024], kSampleSize> strings;
for (int i = 0; i < kSampleSize; i++)
init_string(strings[i], kSampleSize);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp(strings[i], strings[j]);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp - " << (end - start).count() << std::endl;
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp_custom(strings[i], strings[j]);
end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}
And my makefile:
CC=clang++
test: driver.o comparisons.o
$(CC) -o test driver.o comparisons.o
# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
$(CC) -c -o driver.o -std=c++11 -O0 driver.cpp
# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
$(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp
clean:
rm comparisons.o driver.o test
On the advice of this answer, I compiled my comparison function in a separate compilation unit with optimizations and compiled the driver with optimizations turned off, but I still get a slowdown of about 5x.
strcmp - 154519
strcmp_custom - 506282
I also tried copying the FreeBSD implementation but got similar results.
I'm wondering if my performance measurement is overlooking something. Or is the standard library implementation doing something fancier?

I don't know which standard library you have, but just to give you an idea of how serious C library maintainers are about optimizing the string primitives, the default strcmp used by GNU libc on x86-64 is two thousand lines of hand-optimized assembly language, as of version 2.24. There are separate, also hand-optimized, versions for when the SSSE3 and SSE4.2 instruction set extensions are available. (A fair bit of the complexity in that file appears to be because the same source code is used to generate several other functions; the machine code winds up being "only" 1120 instructions.) 2.24 was released roughly a year ago, and even more work has gone into it since.
They go to this much trouble because it's common for one of the string primitives to be the single hottest function in a profile.

Excerpts from my disassembly of glibc v2.2.5, x86_64 linux:
0000000000089cd0 <strcmp##GLIBC_2.2.5>:
89cd0: 48 8b 15 99 a1 33 00 mov 0x33a199(%rip),%rdx # 3c3e70 <_IO_file_jumps##GLIBC_2.2.5+0x790>
89cd7: 48 8d 05 92 58 01 00 lea 0x15892(%rip),%rax # 9f570 <strerror_l##GLIBC_2.6+0x200>
89cde: f7 82 b0 00 00 00 10 testl $0x10,0xb0(%rdx)
89ce5: 00 00 00
89ce8: 75 1a jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cea: 48 8d 05 9f 48 0c 00 lea 0xc489f(%rip),%rax # 14e590 <__nss_passwd_lookup##GLIBC_2.2.5+0x9c30>
89cf1: f7 82 80 00 00 00 00 testl $0x200,0x80(%rdx)
89cf8: 02 00 00
89cfb: 75 07 jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cfd: 48 8d 05 0c 00 00 00 lea 0xc(%rip),%rax # 89d10 <strcmp##GLIBC_2.2.5+0x40>
89d04: c3 retq
89d05: 90 nop
89d06: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
89d0d: 00 00 00
89d10: 89 f1 mov %esi,%ecx
89d12: 89 f8 mov %edi,%eax
89d14: 48 83 e1 3f and $0x3f,%rcx
89d18: 48 83 e0 3f and $0x3f,%rax
89d1c: 83 f9 30 cmp $0x30,%ecx
89d1f: 77 3f ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d21: 83 f8 30 cmp $0x30,%eax
89d24: 77 3a ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d26: 66 0f 12 0f movlpd (%rdi),%xmm1
89d2a: 66 0f 12 16 movlpd (%rsi),%xmm2
89d2e: 66 0f 16 4f 08 movhpd 0x8(%rdi),%xmm1
89d33: 66 0f 16 56 08 movhpd 0x8(%rsi),%xmm2
89d38: 66 0f ef c0 pxor %xmm0,%xmm0
89d3c: 66 0f 74 c1 pcmpeqb %xmm1,%xmm0
89d40: 66 0f 74 ca pcmpeqb %xmm2,%xmm1
89d44: 66 0f f8 c8 psubb %xmm0,%xmm1
89d48: 66 0f d7 d1 pmovmskb %xmm1,%edx
89d4c: 81 ea ff ff 00 00 sub $0xffff,%edx
...
The real thing is 1183 lines of assembly, with lots of potential cleverness about detecting system features and vectorized instructions. libc maintainers know that they can get an edge by just optimizing some of the functions called thousands of times by applications.
For comparison, your version at -O3:
comparisons.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
0: 8a 0e mov (%rsi),%cl
2: 8a 07 mov (%rdi),%al
4: 38 c1 cmp %al,%cl
6: 75 1e jne 26 <_Z13strcmp_customPKcS0_+0x26>
if (*a == '\0') return 0;
8: 48 ff c6 inc %rsi
b: 48 ff c7 inc %rdi
e: 66 90 xchg %ax,%ax
10: 31 c0 xor %eax,%eax
12: 84 c9 test %cl,%cl
14: 74 18 je 2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
16: 0f b6 0e movzbl (%rsi),%ecx
19: 0f b6 07 movzbl (%rdi),%eax
1c: 48 ff c6 inc %rsi
1f: 48 ff c7 inc %rdi
22: 38 c1 cmp %al,%cl
24: 74 ea je 10 <_Z13strcmp_customPKcS0_+0x10>
26: 0f be d0 movsbl %al,%edx
29: 0f be c1 movsbl %cl,%eax
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
2c: 29 d0 sub %edx,%eax
}
2e: c3 retq

Related

Constructor call of function local static object skipped

I have written the following function:
inline void putc(int c)
{
static Cons_serial serial;
if (serial.enabled())
serial.putc(c);
}
Where Cons_serial is a class with a non-trivial default constructor. I believe the exact class definition is not important here but you can correct me on that. I'm compiling for x86 32 bit with g++ using the following flags: -m32 -fno-PIC -ffreestanding -fno-rtti -fno-exceptions -fno-threadsafe-statics -O0, the generated assembly code for putc looks like this:
00100221 <_Z4putci>:
100221: 55 push %ebp
100222: 89 e5 mov %esp,%ebp
100224: 83 ec 08 sub $0x8,%esp
100227: b8 f8 02 10 00 mov $0x1002f8,%eax
10022c: 0f b6 00 movzbl (%eax),%eax
10022f: 84 c0 test %al,%al
100231: 75 18 jne 10024b <_Z4putci+0x2a>
100233: 83 ec 0c sub $0xc,%esp
100236: 68 f0 02 10 00 push $0x1002f0
10023b: e8 36 fe ff ff call 100076 <_ZN11Cons_serialC1Ev>
100240: 83 c4 10 add $0x10,%esp
100243: b8 f8 02 10 00 mov $0x1002f8,%eax
100248: c6 00 01 movb $0x1,(%eax)
10024b: 83 ec 0c sub $0xc,%esp
10024e: 68 f0 02 10 00 push $0x1002f0
100253: e8 4c fe ff ff call 1000a4 <_ZNK11Cons_serial7enabledEv>
100258: 83 c4 10 add $0x10,%esp
10025b: 84 c0 test %al,%al
10025d: 74 13 je 100272 <_Z4putci+0x51>
10025f: 83 ec 08 sub $0x8,%esp
100262: ff 75 08 pushl 0x8(%ebp)
100265: 68 f0 02 10 00 push $0x1002f0
10026a: e8 41 fe ff ff call 1000b0 <_ZN11Cons_serial4putcEi>
10026f: 83 c4 10 add $0x10,%esp
100272: 90 nop
100273: c9 leave
100274: c3 ret
During execution, the jump at 100231 is taken the first time the function runs, thus Cons_serial is never called. Why knowledge of x86 assembly is questionable, what do the instructions leading up to that one actually do? I assume the code is meant to skip the constructor call on subsequent function calls. But then why is it skipped the first time the function runs as well?
EDIT: This code is part of a kernel I'm writing and I suspect the root cause might be an issue with my kernel's .bss section, here is the linker script I use:
OUTPUT_FORMAT("elf32-i386")
ENTRY(_start)
SECTIONS
{
. = 0x100000;
.text : AT(0x100000) {
*(.text)
}
.data : SUBALIGN(2) {
*(.data);
*(.rodata*);
}
.bss : SUBALIGN(4) {
__bss_start = .;
*(.COMMON);
*(.bss*)
. = ALIGN(4);
__bss_end = .;
}
/DISCARD/ : {
*(.eh_frame)
*(.comment)
}
}
And here's the code I use to zero the .bss section:
extern uint32_t __bss_start;
extern uint32_t __bss_end;
void zero_bss()
{
for (uint32_t bss_addr = __bss_start; bss_addr < __bss_end; ++bss_addr)
*reinterpret_cast<uint8_t *>(bss_addr) = 0x00;
}
But when zero_bss runs, __bss_start is 0x27 and __bss_end is 0x101 which is not at all what I'd except (the BSS should encompass address 0x1002f8 after all).
I've solved it now, the hint from #user3124812 was what got me there, thanks again.
My zero_bss code was faulty, I needed to take the addresses of the __bss* markers from the linker script, i.e.:
extern uint8_t __bss_start;
extern uint8_t __bss_end;
void zero_bss()
{
uint8_t *bss_start = reinterpret_cast<uint8_t *>(&__bss_start);
uint8_t *bss_end = reinterpret_cast<uint8_t *>(&__bss_end);
for (uint8_t *bss_addr = bss_start; bss_addr < bss_end; ++bss_addr)
*bss_addr = 0x00;
}
Now everything works.

Which is better: returning tuple or passing arguments to function as references?

I created code where I have two functions returnValues and returnValuesVoid. One returns tuple of 2 values and other accept argument's references to the function.
#include <iostream>
#include <tuple>
std::tuple<int, int> returnValues(const int a, const int b) {
return std::tuple(a,b);
}
void returnValuesVoid(int &a,int &b) {
a += 100;
b += 100;
}
int main() {
auto [x,y] = returnValues(10,20);
std::cout << x ;
std::cout << y ;
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
std::cout << b ;
}
I read about http://en.cppreference.com/w/cpp/language/structured_binding
which can destruct tuple to auto [x,y] variables.
Is auto [x,y] = returnValues(10,20); better than passing by references? As I know it's slower because it does have to return tuple object and reference just works on orginal variables passed to function so there's no reason to use it except cleaner code.
As auto [x,y] is since C++17 do people use it on production? I see that it looks cleaner than returnValuesVoid which is void type and but does it have other advantages over passing by reference?
Look at disassemble (compiled with GCC -O3):
It takes more instruction to implement tuple call.
0000000000000000 <returnValues(int, int)>:
0: 83 c2 64 add $0x64,%edx
3: 83 c6 64 add $0x64,%esi
6: 48 89 f8 mov %rdi,%rax
9: 89 17 mov %edx,(%rdi)
b: 89 77 04 mov %esi,0x4(%rdi)
e: c3 retq
f: 90 nop
0000000000000010 <returnValuesVoid(int&, int&)>:
10: 83 07 64 addl $0x64,(%rdi)
13: 83 06 64 addl $0x64,(%rsi)
16: c3 retq
But less instructions for the tuple caller:
0000000000000000 <callTuple()>:
0: 48 83 ec 18 sub $0x18,%rsp
4: ba 14 00 00 00 mov $0x14,%edx
9: be 0a 00 00 00 mov $0xa,%esi
e: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
13: e8 00 00 00 00 callq 18 <callTuple()+0x18> // call returnValues
18: 8b 74 24 0c mov 0xc(%rsp),%esi
1c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
23: e8 00 00 00 00 callq 28 <callTuple()+0x28> // std::cout::operator<<
28: 8b 74 24 08 mov 0x8(%rsp),%esi
2c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
33: e8 00 00 00 00 callq 38 <callTuple()+0x38> // std::cout::operator<<
38: 48 83 c4 18 add $0x18,%rsp
3c: c3 retq
3d: 0f 1f 00 nopl (%rax)
0000000000000040 <callRef()>:
40: 48 83 ec 18 sub $0x18,%rsp
44: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
49: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
4e: c7 44 24 08 0a 00 00 movl $0xa,0x8(%rsp)
55: 00
56: c7 44 24 0c 14 00 00 movl $0x14,0xc(%rsp)
5d: 00
5e: e8 00 00 00 00 callq 63 <callRef()+0x23> // call returnValuesVoid
63: 8b 74 24 08 mov 0x8(%rsp),%esi
67: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
6e: e8 00 00 00 00 callq 73 <callRef()+0x33> // std::cout::operator<<
73: 8b 74 24 0c mov 0xc(%rsp),%esi
77: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7e: e8 00 00 00 00 callq 83 <callRef()+0x43> // std::cout::operator<<
83: 48 83 c4 18 add $0x18,%rsp
87: c3 retq
I don't think there is any considerable performance different, but the tuple one is more clear, more readable.
Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.
0000000000000000 <callTuple()>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7: 48 83 ec 08 sub $0x8,%rsp
b: be 6e 00 00 00 mov $0x6e,%esi
10: e8 00 00 00 00 callq 15 <callTuple()+0x15>
15: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
1c: be 78 00 00 00 mov $0x78,%esi
21: 48 83 c4 08 add $0x8,%rsp
25: e9 00 00 00 00 jmpq 2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
2a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000030 <callRef()>:
30: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
37: 48 83 ec 08 sub $0x8,%rsp
3b: be 6e 00 00 00 mov $0x6e,%esi
40: e8 00 00 00 00 callq 45 <callRef()+0x15>
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
4c: be 78 00 00 00 mov $0x78,%esi
51: 48 83 c4 08 add $0x8,%rsp
55: e9 00 00 00 00 jmpq 5a <callRef()+0x2a> // TCO, optimized way to call a function and also return
Focus on what's more readable and which approach provides a better intuition to the reader, and please keep the performance issues you might think that arise in the background.
A function that returns a tuple (or a pair, a struct, etc.) is yelling to the author that the function returns something, that almost always has some meaning that the user can take into account.
A function that gives back the results in variables passed by reference, may slip the eye's attention of a tired reader.
So, in general, prefer to return the results by a tuple.
Mike van Dyke pointed to this link:
F.21: To return multiple "out" values, prefer returning a tuple or struct
Reason
A return value is self-documenting as an "output-only"
value. Note that C++ does have multiple return values, by convention
of using a tuple (including pair), possibly with the extra convenience
of tie at the call site.
[...]
Exception
Sometimes, we need to pass an object to a function to manipulate its state. In such cases, passing the object by reference T& is usually the right technique.
Using another compiler (VS 2017) the resulting code shows no difference, as the function calls are just optimized away.
int main() {
00007FF6A9C51E50 sub rsp,28h
auto [x,y] = returnValues(10,20);
std::cout << x ;
00007FF6A9C51E54 mov edx,0Ah
00007FF6A9C51E59 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << y ;
00007FF6A9C51E5E mov edx,14h
00007FF6A9C51E63 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
00007FF6A9C51E68 mov edx,6Eh
00007FF6A9C51E6D call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << b ;
00007FF6A9C51E72 mov edx,78h
00007FF6A9C51E77 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
}
00007FF6A9C51E7C xor eax,eax
00007FF6A9C51E7E add rsp,28h
00007FF6A9C51E82 ret
So using clearer code seems to be the obvious choice.
What Zang said is true but not up-to the point. I ran the code provided in question with chrono to measure time. I think the answer needs to be edited after observing what happened.
For 1M iterations, time taken by function call via reference was 3ms while time taken by function call via std::tie combined with std::tuple was about 94ms.
Though the difference seems very less in practice, still tuple one will perform slightly slower. Hence, for performance intensive systems, I suggest using call by reference.
My code:
#include <iostream>
#include <tuple>
#include <chrono>
std::tuple<int, int> returnValues(const int a, const int b)
{
return std::tuple<int, int>(a, b);
}
void returnValuesVoid(int &a, int &b)
{
a += 100;
b += 100;
}
int main()
{
int a = 10, b = 20;
auto begin = std::chrono::high_resolution_clock::now();
int x, y;
for (int i = 0; i < 1000000; i++)
{
std::tie(x, y) = returnValues(a, b);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()) << '\n';
a = 10;
b = 20;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++)
{
returnValuesVoid(a, b);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count()) << '\n';
}

C++ Performance: Two shorts vs int bit operations

What of this options have the best performance:
Use two shorts for two sensible informations, or use one int and use bit operations to retrive half of it for each sensible information?
This may vary depending on architecture and compiler, but generally one using int and bit operations on it will have slightly less performance. But the difference of performance will be so minimal that till now I haven't written code that will require that level of optimization. I depend on the compiler to these kind of optimizations for me.
Now let us check the below C++ code that simulates the behaviur:
int main()
{
int x = 100;
short a = 255;
short b = 127;
short p = x >> 16;
short q = x & 0xffff;
short y = a;
short z = b;
return 0;
}
The corresponding assembly code on x86_64 system (from gnu g++) will be as shown below:
00000000004004ed <main>:
int main()
{
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
int x = 100;
4004f1: c7 45 fc 64 00 00 00 movl $0x64,-0x4(%rbp)
short a = 255;
4004f8: 66 c7 45 f0 ff 00 movw $0xff,-0x10(%rbp)
short b = 127;
4004fe: 66 c7 45 f2 7f 00 movw $0x7f,-0xe(%rbp)
short p = x >> 16;
400504: 8b 45 fc mov -0x4(%rbp),%eax
400507: c1 f8 10 sar $0x10,%eax
40050a: 66 89 45 f4 mov %ax,-0xc(%rbp)
short q = x & 0xffff;
40050e: 8b 45 fc mov -0x4(%rbp),%eax
400511: 66 89 45 f6 mov %ax,-0xa(%rbp)
short y = a;
400515: 0f b7 45 f0 movzwl -0x10(%rbp),%eax
400519: 66 89 45 f8 mov %ax,-0x8(%rbp)
short z = b;
40051d: 0f b7 45 f2 movzwl -0xe(%rbp),%eax
400521: 66 89 45 fa mov %ax,-0x6(%rbp)
return 0;
400525: b8 00 00 00 00 mov $0x0,%eax
}
As we see, "short p = x >> 16" is the slowest as it uses the extra expensive right shift operation. While all other assignments are equal in terms of cost.

Why does vectorization fail?

I want to optimize my code for vectorization using
-msse2 -ftree-vectorizer-verbose=2.
I have the following simple code:
int main(){
int a[2048], b[2048], c[2048];
int i;
for (i=0; i<2048; i++){
b[i]=0;
c[i]=0;
}
for (i=0; i<2048; i++){
a[i] = b[i] + c[i];
}
return 0;
}
Why do I get the note
test.cpp:10: note: not vectorized: not enough data-refs in basic block.
Thanks!
For what it's worth, after adding an asm volatile("": "+m"(a), "+m"(b), "+m"(c)::"memory"); near the end of main, my copy of gcc emits this:
400610: 48 81 ec 08 60 00 00 sub $0x6008,%rsp
400617: ba 00 20 00 00 mov $0x2000,%edx
40061c: 31 f6 xor %esi,%esi
40061e: 48 8d bc 24 00 20 00 lea 0x2000(%rsp),%rdi
400625: 00
400626: e8 b5 ff ff ff callq 4005e0 <memset#plt>
40062b: ba 00 20 00 00 mov $0x2000,%edx
400630: 31 f6 xor %esi,%esi
400632: 48 8d bc 24 00 40 00 lea 0x4000(%rsp),%rdi
400639: 00
40063a: e8 a1 ff ff ff callq 4005e0 <memset#plt>
40063f: 31 c0 xor %eax,%eax
400641: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400648: c5 f9 6f 84 04 00 20 vmovdqa 0x2000(%rsp,%rax,1),%xmm0
40064f: 00 00
400651: c5 f9 fe 84 04 00 40 vpaddd 0x4000(%rsp,%rax,1),%xmm0,%xmm0
400658: 00 00
40065a: c5 f8 29 04 04 vmovaps %xmm0,(%rsp,%rax,1)
40065f: 48 83 c0 10 add $0x10,%rax
400663: 48 3d 00 20 00 00 cmp $0x2000,%rax
400669: 75 dd jne 400648 <main+0x38>
So it recognised that the first loop was just doing memset to a couple arrays and the second loop was doing a vector addition, which it appropriately vectorised.
I'm using gcc version 4.9.0 20140521 (prerelease) (GCC).
An older machine with gcc version 4.7.2 (Debian 4.7.2-5) also vectorises the loop, but in a different way. Your -ftree-vectorizer-verbose=2 setting makes it emit the following output:
Analyzing loop at foo155.cc:10
Vectorizing loop at foo155.cc:10
10: LOOP VECTORIZED.
foo155.cc:1: note: vectorized 1 loops in function.
You probably goofed your compiler flags (I used g++ -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 -march=native foo155.cc -o foo155 to build) or have a really old compiler.
remove the first loop and do this
int a[2048], b[2048], c[2048] = {0};
also try this tag
-ftree-vectorize
instead of
-msse2 -ftree-vectorizer-verbose=2

why is inlined function slower than function pointer?

Consider the following code:
typedef void (*Fn)();
volatile long sum = 0;
inline void accu() {
sum+=4;
}
static const Fn map[4] = {&accu, &accu, &accu, &accu};
int main(int argc, char** argv) {
static const long N = 10000000L;
if (argc == 1)
{
for (long i = 0; i < N; i++)
{
accu();
accu();
accu();
accu();
}
}
else
{
for (long i = 0; i < N; i++)
{
for (int j = 0; j < 4; j++)
(*map[j])();
}
}
}
When I compiled it with:
g++ -O3 test.cpp
I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.
But the results surprised me:
time ./a.out
real 0m0.108s
user 0m0.104s
sys 0m0.000s
time ./a.out 1
real 0m0.095s
user 0m0.088s
sys 0m0.004s
I don't understand why, so I did an objdump:
objdump -DStTrR a.out > a.s
and the disassembly doesn't seem to explain the performance result I got:
8048300 <main>:
8048300: 55 push %ebp
8048301: 89 e5 mov %esp,%ebp
8048303: 53 push %ebx
8048304: bb 80 96 98 00 mov $0x989680,%ebx
8048309: 83 e4 f0 and $0xfffffff0,%esp
804830c: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048310: 74 27 je 8048339 <main+0x39>
8048312: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
8048318: e8 23 01 00 00 call 8048440 <_Z4accuv>
804831d: e8 1e 01 00 00 call 8048440 <_Z4accuv>
8048322: e8 19 01 00 00 call 8048440 <_Z4accuv>
8048327: e8 14 01 00 00 call 8048440 <_Z4accuv>
804832c: 83 eb 01 sub $0x1,%ebx
804832f: 90 nop
8048330: 75 e6 jne 8048318 <main+0x18>
8048332: 31 c0 xor %eax,%eax
8048334: 8b 5d fc mov -0x4(%ebp),%ebx
8048337: c9 leave
8048338: c3 ret
8048339: b8 80 96 98 00 mov $0x989680,%eax
804833e: 66 90 xchg %ax,%ax
8048340: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048346: 83 c2 04 add $0x4,%edx
8048349: 89 15 18 a0 04 08 mov %edx,0x804a018
804834f: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048355: 83 c2 04 add $0x4,%edx
8048358: 89 15 18 a0 04 08 mov %edx,0x804a018
804835e: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048364: 83 c2 04 add $0x4,%edx
8048367: 89 15 18 a0 04 08 mov %edx,0x804a018
804836d: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048373: 83 c2 04 add $0x4,%edx
8048376: 83 e8 01 sub $0x1,%eax
8048379: 89 15 18 a0 04 08 mov %edx,0x804a018
804837f: 75 bf jne 8048340 <main+0x40>
8048381: eb af jmp 8048332 <main+0x32>
8048383: 90 nop
...
8048440 <_Z4accuv>:
8048440: a1 18 a0 04 08 mov 0x804a018,%eax
8048445: 83 c0 04 add $0x4,%eax
8048448: a3 18 a0 04 08 mov %eax,0x804a018
804844d: c3 ret
804844e: 90 nop
804844f: 90 nop
It seems the direct call branch is definitely doing less than the function pointer branch.
But why does the function pointer branch run faster than the direct call?
And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.
It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.
As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):
mov %edx,0x804a018
mov 0x804a018,%edx
Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:
... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.
The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.
Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:
long x1 = 0;
for (long i = 0; i < N; i++)
{
x1 ^= i<<8;
accu();
x1 ^= i<<1;
accu();
x1 ^= i<<2;
accu();
x1 ^= i<<4;
accu();
}
sum += x1;