C++ Performance: Two shorts vs int bit operations

C++ Performance: Two shorts vs int bit operations - c++

What of this options have the best performance:
Use two shorts for two sensible informations, or use one int and use bit operations to retrive half of it for each sensible information?

This may vary depending on architecture and compiler, but generally one using int and bit operations on it will have slightly less performance. But the difference of performance will be so minimal that till now I haven't written code that will require that level of optimization. I depend on the compiler to these kind of optimizations for me.
Now let us check the below C++ code that simulates the behaviur:
int main()
{
int x = 100;
short a = 255;
short b = 127;
short p = x >> 16;
short q = x & 0xffff;
short y = a;
short z = b;
return 0;
}
The corresponding assembly code on x86_64 system (from gnu g++) will be as shown below:
00000000004004ed <main>:
int main()
{
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
int x = 100;
4004f1: c7 45 fc 64 00 00 00 movl $0x64,-0x4(%rbp)
short a = 255;
4004f8: 66 c7 45 f0 ff 00 movw $0xff,-0x10(%rbp)
short b = 127;
4004fe: 66 c7 45 f2 7f 00 movw $0x7f,-0xe(%rbp)
short p = x >> 16;
400504: 8b 45 fc mov -0x4(%rbp),%eax
400507: c1 f8 10 sar $0x10,%eax
40050a: 66 89 45 f4 mov %ax,-0xc(%rbp)
short q = x & 0xffff;
40050e: 8b 45 fc mov -0x4(%rbp),%eax
400511: 66 89 45 f6 mov %ax,-0xa(%rbp)
short y = a;
400515: 0f b7 45 f0 movzwl -0x10(%rbp),%eax
400519: 66 89 45 f8 mov %ax,-0x8(%rbp)
short z = b;
40051d: 0f b7 45 f2 movzwl -0xe(%rbp),%eax
400521: 66 89 45 fa mov %ax,-0x6(%rbp)
return 0;
400525: b8 00 00 00 00 mov $0x0,%eax
}
As we see, "short p = x >> 16" is the slowest as it uses the extra expensive right shift operation. While all other assignments are equal in terms of cost.

Related

Which is better: returning tuple or passing arguments to function as references?

I created code where I have two functions returnValues and returnValuesVoid. One returns tuple of 2 values and other accept argument's references to the function.
#include <iostream>
#include <tuple>
std::tuple<int, int> returnValues(const int a, const int b) {
return std::tuple(a,b);
}
void returnValuesVoid(int &a,int &b) {
a += 100;
b += 100;
}
int main() {
auto [x,y] = returnValues(10,20);
std::cout << x ;
std::cout << y ;
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
std::cout << b ;
}
I read about http://en.cppreference.com/w/cpp/language/structured_binding
which can destruct tuple to auto [x,y] variables.
Is auto [x,y] = returnValues(10,20); better than passing by references? As I know it's slower because it does have to return tuple object and reference just works on orginal variables passed to function so there's no reason to use it except cleaner code.
As auto [x,y] is since C++17 do people use it on production? I see that it looks cleaner than returnValuesVoid which is void type and but does it have other advantages over passing by reference?

Look at disassemble (compiled with GCC -O3):
It takes more instruction to implement tuple call.
0000000000000000 <returnValues(int, int)>:
0: 83 c2 64 add $0x64,%edx
3: 83 c6 64 add $0x64,%esi
6: 48 89 f8 mov %rdi,%rax
9: 89 17 mov %edx,(%rdi)
b: 89 77 04 mov %esi,0x4(%rdi)
e: c3 retq
f: 90 nop
0000000000000010 <returnValuesVoid(int&, int&)>:
10: 83 07 64 addl $0x64,(%rdi)
13: 83 06 64 addl $0x64,(%rsi)
16: c3 retq
But less instructions for the tuple caller:
0000000000000000 <callTuple()>:
0: 48 83 ec 18 sub $0x18,%rsp
4: ba 14 00 00 00 mov $0x14,%edx
9: be 0a 00 00 00 mov $0xa,%esi
e: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
13: e8 00 00 00 00 callq 18 <callTuple()+0x18> // call returnValues
18: 8b 74 24 0c mov 0xc(%rsp),%esi
1c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
23: e8 00 00 00 00 callq 28 <callTuple()+0x28> // std::cout::operator<<
28: 8b 74 24 08 mov 0x8(%rsp),%esi
2c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
33: e8 00 00 00 00 callq 38 <callTuple()+0x38> // std::cout::operator<<
38: 48 83 c4 18 add $0x18,%rsp
3c: c3 retq
3d: 0f 1f 00 nopl (%rax)
0000000000000040 <callRef()>:
40: 48 83 ec 18 sub $0x18,%rsp
44: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
49: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
4e: c7 44 24 08 0a 00 00 movl $0xa,0x8(%rsp)
55: 00
56: c7 44 24 0c 14 00 00 movl $0x14,0xc(%rsp)
5d: 00
5e: e8 00 00 00 00 callq 63 <callRef()+0x23> // call returnValuesVoid
63: 8b 74 24 08 mov 0x8(%rsp),%esi
67: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
6e: e8 00 00 00 00 callq 73 <callRef()+0x33> // std::cout::operator<<
73: 8b 74 24 0c mov 0xc(%rsp),%esi
77: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7e: e8 00 00 00 00 callq 83 <callRef()+0x43> // std::cout::operator<<
83: 48 83 c4 18 add $0x18,%rsp
87: c3 retq
I don't think there is any considerable performance different, but the tuple one is more clear, more readable.
Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.
0000000000000000 <callTuple()>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7: 48 83 ec 08 sub $0x8,%rsp
b: be 6e 00 00 00 mov $0x6e,%esi
10: e8 00 00 00 00 callq 15 <callTuple()+0x15>
15: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
1c: be 78 00 00 00 mov $0x78,%esi
21: 48 83 c4 08 add $0x8,%rsp
25: e9 00 00 00 00 jmpq 2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
2a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000030 <callRef()>:
30: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
37: 48 83 ec 08 sub $0x8,%rsp
3b: be 6e 00 00 00 mov $0x6e,%esi
40: e8 00 00 00 00 callq 45 <callRef()+0x15>
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
4c: be 78 00 00 00 mov $0x78,%esi
51: 48 83 c4 08 add $0x8,%rsp
55: e9 00 00 00 00 jmpq 5a <callRef()+0x2a> // TCO, optimized way to call a function and also return

Focus on what's more readable and which approach provides a better intuition to the reader, and please keep the performance issues you might think that arise in the background.
A function that returns a tuple (or a pair, a struct, etc.) is yelling to the author that the function returns something, that almost always has some meaning that the user can take into account.
A function that gives back the results in variables passed by reference, may slip the eye's attention of a tired reader.
So, in general, prefer to return the results by a tuple.
Mike van Dyke pointed to this link:
F.21: To return multiple "out" values, prefer returning a tuple or struct
Reason
A return value is self-documenting as an "output-only"
value. Note that C++ does have multiple return values, by convention
of using a tuple (including pair), possibly with the extra convenience
of tie at the call site.
[...]
Exception
Sometimes, we need to pass an object to a function to manipulate its state. In such cases, passing the object by reference T& is usually the right technique.

Using another compiler (VS 2017) the resulting code shows no difference, as the function calls are just optimized away.
int main() {
00007FF6A9C51E50 sub rsp,28h
auto [x,y] = returnValues(10,20);
std::cout << x ;
00007FF6A9C51E54 mov edx,0Ah
00007FF6A9C51E59 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << y ;
00007FF6A9C51E5E mov edx,14h
00007FF6A9C51E63 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
00007FF6A9C51E68 mov edx,6Eh
00007FF6A9C51E6D call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << b ;
00007FF6A9C51E72 mov edx,78h
00007FF6A9C51E77 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
}
00007FF6A9C51E7C xor eax,eax
00007FF6A9C51E7E add rsp,28h
00007FF6A9C51E82 ret
So using clearer code seems to be the obvious choice.

What Zang said is true but not up-to the point. I ran the code provided in question with chrono to measure time. I think the answer needs to be edited after observing what happened.
For 1M iterations, time taken by function call via reference was 3ms while time taken by function call via std::tie combined with std::tuple was about 94ms.
Though the difference seems very less in practice, still tuple one will perform slightly slower. Hence, for performance intensive systems, I suggest using call by reference.
My code:
#include <iostream>
#include <tuple>
#include <chrono>
std::tuple<int, int> returnValues(const int a, const int b)
{
return std::tuple<int, int>(a, b);
}
void returnValuesVoid(int &a, int &b)
{
a += 100;
b += 100;
}
int main()
{
int a = 10, b = 20;
auto begin = std::chrono::high_resolution_clock::now();
int x, y;
for (int i = 0; i < 1000000; i++)
{
std::tie(x, y) = returnValues(a, b);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()) << '\n';
a = 10;
b = 20;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++)
{
returnValuesVoid(a, b);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count()) << '\n';
}

Why is this version of strcmp slower?

I have been trying experiment with improving performance of strcmp under certain conditions. However, I unfortunately cannot even get an implementation of plain vanilla strcmp to perform as well as the library implementation.
I saw a similar question, but the answers say the difference was from the compiler optimizing away the comparison on string literals. My test does not use string literals.
Here's the implementation (comparisons.cpp)
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
}
And here's the test driver (driver.cpp):
#include "comparisons.h"
#include <array>
#include <chrono>
#include <iostream>
void init_string(char* str, int nChars) {
// 10% of strings will be equal, and 90% of strings will have one char different.
// This way, many strings will share long prefixes so strcmp has to exercise a bit.
// Using random strings still shows the custom implementation as slower (just less so).
str[nChars - 1] = '\0';
for (int i = 0; i < nChars - 1; i++)
str[i] = (i % 94) + 32;
if (rand() % 10 != 0)
str[rand() % (nChars - 1)] = 'x';
}
int main(int argc, char** argv) {
srand(1234);
// Pre-generate some strings to compare.
const int kSampleSize = 100;
std::array<char[1024], kSampleSize> strings;
for (int i = 0; i < kSampleSize; i++)
init_string(strings[i], kSampleSize);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp(strings[i], strings[j]);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp - " << (end - start).count() << std::endl;
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp_custom(strings[i], strings[j]);
end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}
And my makefile:
CC=clang++
test: driver.o comparisons.o
$(CC) -o test driver.o comparisons.o
# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
$(CC) -c -o driver.o -std=c++11 -O0 driver.cpp
# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
$(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp
clean:
rm comparisons.o driver.o test
On the advice of this answer, I compiled my comparison function in a separate compilation unit with optimizations and compiled the driver with optimizations turned off, but I still get a slowdown of about 5x.
strcmp - 154519
strcmp_custom - 506282
I also tried copying the FreeBSD implementation but got similar results.
I'm wondering if my performance measurement is overlooking something. Or is the standard library implementation doing something fancier?

I don't know which standard library you have, but just to give you an idea of how serious C library maintainers are about optimizing the string primitives, the default strcmp used by GNU libc on x86-64 is two thousand lines of hand-optimized assembly language, as of version 2.24. There are separate, also hand-optimized, versions for when the SSSE3 and SSE4.2 instruction set extensions are available. (A fair bit of the complexity in that file appears to be because the same source code is used to generate several other functions; the machine code winds up being "only" 1120 instructions.) 2.24 was released roughly a year ago, and even more work has gone into it since.
They go to this much trouble because it's common for one of the string primitives to be the single hottest function in a profile.

Excerpts from my disassembly of glibc v2.2.5, x86_64 linux:
0000000000089cd0 <strcmp##GLIBC_2.2.5>:
89cd0: 48 8b 15 99 a1 33 00 mov 0x33a199(%rip),%rdx # 3c3e70 <_IO_file_jumps##GLIBC_2.2.5+0x790>
89cd7: 48 8d 05 92 58 01 00 lea 0x15892(%rip),%rax # 9f570 <strerror_l##GLIBC_2.6+0x200>
89cde: f7 82 b0 00 00 00 10 testl $0x10,0xb0(%rdx)
89ce5: 00 00 00
89ce8: 75 1a jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cea: 48 8d 05 9f 48 0c 00 lea 0xc489f(%rip),%rax # 14e590 <__nss_passwd_lookup##GLIBC_2.2.5+0x9c30>
89cf1: f7 82 80 00 00 00 00 testl $0x200,0x80(%rdx)
89cf8: 02 00 00
89cfb: 75 07 jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cfd: 48 8d 05 0c 00 00 00 lea 0xc(%rip),%rax # 89d10 <strcmp##GLIBC_2.2.5+0x40>
89d04: c3 retq
89d05: 90 nop
89d06: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
89d0d: 00 00 00
89d10: 89 f1 mov %esi,%ecx
89d12: 89 f8 mov %edi,%eax
89d14: 48 83 e1 3f and $0x3f,%rcx
89d18: 48 83 e0 3f and $0x3f,%rax
89d1c: 83 f9 30 cmp $0x30,%ecx
89d1f: 77 3f ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d21: 83 f8 30 cmp $0x30,%eax
89d24: 77 3a ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d26: 66 0f 12 0f movlpd (%rdi),%xmm1
89d2a: 66 0f 12 16 movlpd (%rsi),%xmm2
89d2e: 66 0f 16 4f 08 movhpd 0x8(%rdi),%xmm1
89d33: 66 0f 16 56 08 movhpd 0x8(%rsi),%xmm2
89d38: 66 0f ef c0 pxor %xmm0,%xmm0
89d3c: 66 0f 74 c1 pcmpeqb %xmm1,%xmm0
89d40: 66 0f 74 ca pcmpeqb %xmm2,%xmm1
89d44: 66 0f f8 c8 psubb %xmm0,%xmm1
89d48: 66 0f d7 d1 pmovmskb %xmm1,%edx
89d4c: 81 ea ff ff 00 00 sub $0xffff,%edx
...
The real thing is 1183 lines of assembly, with lots of potential cleverness about detecting system features and vectorized instructions. libc maintainers know that they can get an edge by just optimizing some of the functions called thousands of times by applications.
For comparison, your version at -O3:
comparisons.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
0: 8a 0e mov (%rsi),%cl
2: 8a 07 mov (%rdi),%al
4: 38 c1 cmp %al,%cl
6: 75 1e jne 26 <_Z13strcmp_customPKcS0_+0x26>
if (*a == '\0') return 0;
8: 48 ff c6 inc %rsi
b: 48 ff c7 inc %rdi
e: 66 90 xchg %ax,%ax
10: 31 c0 xor %eax,%eax
12: 84 c9 test %cl,%cl
14: 74 18 je 2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
16: 0f b6 0e movzbl (%rsi),%ecx
19: 0f b6 07 movzbl (%rdi),%eax
1c: 48 ff c6 inc %rsi
1f: 48 ff c7 inc %rdi
22: 38 c1 cmp %al,%cl
24: 74 ea je 10 <_Z13strcmp_customPKcS0_+0x10>
26: 0f be d0 movsbl %al,%edx
29: 0f be c1 movsbl %cl,%eax
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
2c: 29 d0 sub %edx,%eax
}
2e: c3 retq

Unsigned int to unsigned long long well defined?

I wanted to see what was happening behind the scenes when an unsigned long long was assigned the value of an unsigned int. I made a simple C++ program to try it out and moved all the io out of main():
#include <iostream>
#include <stdlib.h>
void usage() {
std::cout << "Usage: ./u_to_ull <unsigned int>\n";
exit(0);
}
void atoiWarning(int foo) {
std::cout << "WARNING: atoi() returned " << foo << " and (unsigned int)foo is " <<
((unsigned int)foo) << "\n";
}
void result(unsigned long long baz) {
std::cout << "Result as unsigned long long is " << baz << "\n";
}
int main(int argc, char** argv) {
if (argc != 2) usage();
int foo = atoi(argv[1]);
if (foo < 0) atoiWarning(foo);
// Signed to unsigned
unsigned int bar = foo;
// Conversion
unsigned long long baz = -1;
baz = bar;
result(baz);
return 0;
}
The resulting assembly produced this for main:
0000000000400950 <main>:
400950: 55 push %rbp
400951: 48 89 e5 mov %rsp,%rbp
400954: 48 83 ec 20 sub $0x20,%rsp
400958: 89 7d ec mov %edi,-0x14(%rbp)
40095b: 48 89 75 e0 mov %rsi,-0x20(%rbp)
40095f: 83 7d ec 02 cmpl $0x2,-0x14(%rbp)
400963: 74 05 je 40096a <main+0x1a>
400965: e8 3a ff ff ff callq 4008a4 <_Z5usagev>
40096a: 48 8b 45 e0 mov -0x20(%rbp),%rax
40096e: 48 83 c0 08 add $0x8,%rax
400972: 48 8b 00 mov (%rax),%rax
400975: 48 89 c7 mov %rax,%rdi
400978: e8 0b fe ff ff callq 400788 <atoi#plt>
40097d: 89 45 f0 mov %eax,-0x10(%rbp)
400980: 83 7d f0 00 cmpl $0x0,-0x10(%rbp)
400984: 79 0a jns 400990 <main+0x40>
400986: 8b 45 f0 mov -0x10(%rbp),%eax
400989: 89 c7 mov %eax,%edi
40098b: e8 31 ff ff ff callq 4008c1 <_Z11atoiWarningi>
400990: 8b 45 f0 mov -0x10(%rbp),%eax
400993: 89 45 f4 mov %eax,-0xc(%rbp)
400996: 48 c7 45 f8 ff ff ff movq $0xffffffffffffffff,-0x8(%rbp)
40099d: ff
40099e: 8b 45 f4 mov -0xc(%rbp),%eax
4009a1: 48 89 45 f8 mov %rax,-0x8(%rbp)
4009a5: 48 8b 45 f8 mov -0x8(%rbp),%rax
4009a9: 48 89 c7 mov %rax,%rdi
4009ac: e8 66 ff ff ff callq 400917 <_Z6resulty>
4009b1: b8 00 00 00 00 mov $0x0,%eax
4009b6: c9 leaveq
4009b7: c3 retq
The -1 from the C++ makes it clear that -0x8(%rbp) corresponds to baz (due to $0xffffffffffffffff). -0x8(%rbp) is written to by %rax, but the top four bytes of %rax appear to not have been assigned, %eaxwas assigned
Does this suggest that the top 4 bytes of -0x8(%rbp) are undefined?

In the Intel® 64 and IA-32 Architectures Software Developer Manuals, volume 1, chapter 3.4.1.1 (General-Purpose Registers in 64-Bit Mode), it says
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.
So after mov -0xc(%rbp),%eax, the upper half of rax is defined, and it's zero.
This also applies to the 87 C0 encoding of xchg eax, eax, but not to its 90 encoding (which is defined as nop, overruling the rule quoted above).

From C++98 (and C++11 seems to be unchanged) 4.7/2 (integral conversions - no promotions are relevant) we learn:
If the destination type is unsigned, the resulting value is the least
unsigned integer congruent to the source integer (modulo 2n where n is
the number of bits used to represent the unsigned type).
This clearly shows that as long as the source and destination are unsigned and the destination is at least as large as the source, the value will be unchanged. If the compiler generated code that failed to make the larger value equal, the compiler is buggy.

Local variable vs. array access

Which of these would be more computationally efficient, and why?
A) Repeated array access:
for(i=0; i<numbers.length; i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
B) Setting a local variable:
for(i=0; i<numbers.length; i++) {
int n = numbers[i];
result[i] = n * n * n;
}
Would not the repeated array access version have to be calculated (using pointer arithmetic), making the first option slower because it is doing this?:
for(i=0; i<numbers.length; i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}

Any sufficiently sophisticated compiler will generate the same code for all three solutions. I turned your three versions into a small C program (with a minor adjustement, I changed the access numbers.length to a macro invocation which gives the length of an array):
#include <stddef.h>
size_t i;
static const int numbers[] = { 0, 1, 2, 4, 5, 6, 7, 8, 9 };
#define ARRAYLEN(x) (sizeof((x)) / sizeof(*(x)))
static int result[ARRAYLEN(numbers)];
void versionA(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = numbers[i] * numbers[i] * numbers[i];
}
}
void versionB(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
int n = numbers[i];
result[i] = n * n * n;
}
}
void versionC(void)
{
for(i=0; i<ARRAYLEN(numbers); i++) {
result[i] = *(numbers + i) * *(numbers + i) * *(numbers + i);
}
}
I then compiled it using optimizations (and debug symbols, for prettier disassembly) with Visual Studio 2012:
C:\Temp>cl /Zi /O2 /Wall /c so19244189.c
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
so19244189.c
Finally, here's the disassembly:
C:\Temp>dumpbin /disasm so19244189.obj
[..]
_versionA:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionB:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
_versionC:
00000000: 33 C0 xor eax,eax
00000002: 8B 0C 85 00 00 00 mov ecx,dword ptr _numbers[eax*4]
00
00000009: 8B D1 mov edx,ecx
0000000B: 0F AF D1 imul edx,ecx
0000000E: 0F AF D1 imul edx,ecx
00000011: 89 14 85 00 00 00 mov dword ptr _result[eax*4],edx
00
00000018: 40 inc eax
00000019: 83 F8 09 cmp eax,9
0000001C: 72 E4 jb 00000002
0000001E: A3 00 00 00 00 mov dword ptr [_i],eax
00000023: C3 ret
Note how the assembly is exactly the same in all cases. So the correct answer to your question
Which of these would be more computationally efficient, and why?
for this compiler is: mu. Your question cannot be answered because it's based on incorrect assumptions. None of the answers is faster than any other.

The theoretical answer:
A reasonably good optimizing compiler should convert version A to version B, and perform only one load from memory. There should be no performance difference if optimization is enabled.
If optimization is disabled, version A will be slower, because the address must be computed 3 times and there are 3 memory loads (2 of them are cached and very fast, but it's still slower than reusing a register).
In practice, the answer will depend on your compiler, and you should check this by benchmarking.

It depends on compiler but all of them should be the same.
First lets look at case B smart compiler will generate code to load value into register only once so it doesn't matter if you use some additional variable or not, compiler generates opcode for mov instruction and has value into register. So B is the same as A.
Now lets compare A and C. We should look at opeators [] inline implementation. a[b] actually is *(a + b) so *(numbers + i) the same as numbers[i] that means cases A and C are the same.
So we have (A==B) && (A==C) all in all (A==B==C) If you know what I mean :).

why is inlined function slower than function pointer?

Consider the following code:
typedef void (*Fn)();
volatile long sum = 0;
inline void accu() {
sum+=4;
}
static const Fn map[4] = {&accu, &accu, &accu, &accu};
int main(int argc, char** argv) {
static const long N = 10000000L;
if (argc == 1)
{
for (long i = 0; i < N; i++)
{
accu();
accu();
accu();
accu();
}
}
else
{
for (long i = 0; i < N; i++)
{
for (int j = 0; j < 4; j++)
(*map[j])();
}
}
}
When I compiled it with:
g++ -O3 test.cpp
I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.
But the results surprised me:
time ./a.out
real 0m0.108s
user 0m0.104s
sys 0m0.000s
time ./a.out 1
real 0m0.095s
user 0m0.088s
sys 0m0.004s
I don't understand why, so I did an objdump:
objdump -DStTrR a.out > a.s
and the disassembly doesn't seem to explain the performance result I got:
8048300 <main>:
8048300: 55 push %ebp
8048301: 89 e5 mov %esp,%ebp
8048303: 53 push %ebx
8048304: bb 80 96 98 00 mov $0x989680,%ebx
8048309: 83 e4 f0 and $0xfffffff0,%esp
804830c: 83 7d 08 01 cmpl $0x1,0x8(%ebp)
8048310: 74 27 je 8048339 <main+0x39>
8048312: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
8048318: e8 23 01 00 00 call 8048440 <_Z4accuv>
804831d: e8 1e 01 00 00 call 8048440 <_Z4accuv>
8048322: e8 19 01 00 00 call 8048440 <_Z4accuv>
8048327: e8 14 01 00 00 call 8048440 <_Z4accuv>
804832c: 83 eb 01 sub $0x1,%ebx
804832f: 90 nop
8048330: 75 e6 jne 8048318 <main+0x18>
8048332: 31 c0 xor %eax,%eax
8048334: 8b 5d fc mov -0x4(%ebp),%ebx
8048337: c9 leave
8048338: c3 ret
8048339: b8 80 96 98 00 mov $0x989680,%eax
804833e: 66 90 xchg %ax,%ax
8048340: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048346: 83 c2 04 add $0x4,%edx
8048349: 89 15 18 a0 04 08 mov %edx,0x804a018
804834f: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048355: 83 c2 04 add $0x4,%edx
8048358: 89 15 18 a0 04 08 mov %edx,0x804a018
804835e: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048364: 83 c2 04 add $0x4,%edx
8048367: 89 15 18 a0 04 08 mov %edx,0x804a018
804836d: 8b 15 18 a0 04 08 mov 0x804a018,%edx
8048373: 83 c2 04 add $0x4,%edx
8048376: 83 e8 01 sub $0x1,%eax
8048379: 89 15 18 a0 04 08 mov %edx,0x804a018
804837f: 75 bf jne 8048340 <main+0x40>
8048381: eb af jmp 8048332 <main+0x32>
8048383: 90 nop
...
8048440 <_Z4accuv>:
8048440: a1 18 a0 04 08 mov 0x804a018,%eax
8048445: 83 c0 04 add $0x4,%eax
8048448: a3 18 a0 04 08 mov %eax,0x804a018
804844d: c3 ret
804844e: 90 nop
804844f: 90 nop
It seems the direct call branch is definitely doing less than the function pointer branch.
But why does the function pointer branch run faster than the direct call?
And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.

It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.
As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):
mov %edx,0x804a018
mov 0x804a018,%edx
Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:
... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.
The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.
Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:
long x1 = 0;
for (long i = 0; i < N; i++)
{
x1 ^= i<<8;
accu();
x1 ^= i<<1;
accu();
x1 ^= i<<2;
accu();
x1 ^= i<<4;
accu();
}
sum += x1;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Performance: Two shorts vs int bit operations - c++

What of this options have the best performance: Use two shorts for two sensible informations, or use one int and use bit operations to retrive half of it for each sensible information?

Related

Which is better: returning tuple or passing arguments to function as references?

Why is this version of strcmp slower?

Unsigned int to unsigned long long well defined?

Local variable vs. array access

why is inlined function slower than function pointer?

Categories

Resources