C++ Return Performance

C++ Return Performance - c++

I have a question about performance. I think this can also applies to other languages (not only C++).
Imagine that I have this function:
int addNumber(int a, int b){
int result = a + b;
return result;
}
Is there any performance improvement if I write the code above like this?
int addNumber(int a, int b){
return a + b;
}
I have this question because the second function doesn´t declare a 3rd variable. But would the compiler detect this in the first code?

To answer this question you can look at the generated assembler code. With -O2, x86-64 gcc 6.2 generates exactly the same code for both methods:
addNumber(int, int):
lea eax, [rdi+rsi]
ret
addNumber2(int, int):
lea eax, [rdi+rsi]
ret
Only without optimization turned on, there is a difference:
addNumber(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add eax, edx
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
addNumber2(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
However, performance comparison without optimization is meaningless

In principle there is no difference between the two approaches. The majority of compilers have handled this type of optimisation for some decades.
Additionally, if the function can be inlined (e.g. its definition is visible to the compiler when compiling code that uses such a function) the majority of compilers will eliminate the function altogether, and simply emit code to add the two variables passed and store the result as required by the caller.
Obviously, the comments above assume compiling with a relevant optimisation setting (e.g. not doing a debug build without optimisation).
Personally, I would not write such a function anyway. It is easier, in the caller, to write c = a + b instead of c = addNumber(a, b), so having a function like that offers no benefit to either programmer (effort to understand) or program (performance, etc). You might as well write comments that give no useful information.
c = a + b; // add a and b and store into c
Any self-respecting code reviewer would complain bitterly about uninformative functions or uninformative comments.
I'd only use such a function if its name conveyed some special meaning (i.e. more than just adding two values) for the application
c = FunkyOperation(a,b);
int FunkyOperation(int a, int b)
{
/* Many useful ways of implementing this operation.
One of those ways happens to be addition, but we need to
go through 25 pages of obscure mathematical proof to
realise that
*/
return a + b;
}

Related

calls to constexpr vs inline functions compile to different assembly with optimization disabled

I came across this awesome online Compiler Explorer https://godbolt.org/
which shows assembly version of your code.
I was also reading about new C++ 11 features and found out about constexpr.
take a look at square function below :
constexpr int square(int num) {
return num * num;
}
int main()
{
int result = square(2);
return 0;
}
and following assembly code generated for two versions (constexpr and inline)
CONSTEXPR https://godbolt.org/z/c69qrevET
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 4 ; compile time constant 4 = 2*2
mov eax, 0
pop rbp
ret
INLINE https://godbolt.org/z/czaKT8fhY
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
imul eax, DWORD PTR [rbp-4]
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 2
call square(int)
mov DWORD PTR [rbp-4], eax
mov eax, 0
leave
ret
I read everywhere that functions like this can be inlined but why there's a function call code in asm version? According to inline definition it should be avoided right?

constexpr functions are not guaranteed to be executed at compile-time unless they're used in a context where a constant expression is required. Change your code to
int main()
{
constexpr int result = square(2);
return 0;
}
and you'll see a difference, because constexpr variables require to be initialized with a constant expression.
Note that optimization level also matters.

Should I create object on a free store only for one call?

Assuming that code is located inside if block, what are differences between creating object in a free store and doing only one call on it:
auto a = aFactory.createA();
int result = a->foo(5);
and making call directly on returned pointer?
int result = aFactory.createA()->foo(5);
Is there any difference in performance? Which way is better?
#include <iostream>
#include <memory>
class A
{
public:
int foo(int a){return a+3;}
};
class AFactory
{
public:
std::unique_ptr<A> createA(){return std::make_unique<A>();}
};
int main()
{
AFactory aFactory;
bool condition = true;
if(condition)
{
auto a = aFactory.createA();
int result = a->foo(5);
}
}

Look here. There is no difference in code generated for both versions even with optimisations disabled.
Using gcc7.1 with -std=c++1z -O0
auto a = aFactory.createA();
int result = a->foo(5);
is compiled to:
lea rax, [rbp-24]
lea rdx, [rbp-9]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
and int result = aFactory.createA()->foo(5); to:
lea rax, [rbp-16]
lea rdx, [rbp-17]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
So they are pretty much identical.
This outcome is understandable when you realise the only difference between the two versions is that in the first one we assign the name to our object, while in the second we work with an unnamed one. Other then that, they are both created on heap and used the same way. And since variable name means nothing to the compiler - it is only relevant for code-readers - it treats the two as if they were identical.

In your simple case it will not make a difference, because the (main) function ends right after creating and using a.
If some more lines of code would follow, the destruction of the a object would happen at the end of the if block in main, while in the one line case it becomes destructed at the end of that single line. However, it would be bad design if the destructor of a more sophisticated class A would make a difference on that.
Due to compiler optimizations performance questions should always be answered by testing with a profiler on the concrete code.

How does reference translates to asm in gcc?

Is reference is compiled as usual pointer or it has other stuff behind?
And how does it differ in clang?

You can think of a reference as an immutable pointer that is automatically de-referenced on usage. This isn't what the C++ standard says, so you cannot rely on that being an actual implementation.
Practically speaking though, it likely to be what you see in many cases.
Take the following example in the case of parameter passing:
#include <stdio.h>
void function (int *const n){
printf("%d",*n);
}
void function (int & n){
printf("%d",n);
}
int main(){
int n = 123;
function(&n);
function(n);
}
Both gcc and clang produce identical code for the functions without any optimizations enabled:
function(int*):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
function(int&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret

How does reference translates to asm in gcc?
In general: It depends.
To find that out in a specific case, you can test by reading the generated assembly.
Is reference is compiled as usual pointer or it has other stuff behind?
The implementation of code using a reference is practically identical to one using a pointer to achieve identical indirection. How each are implemented, is not guaranteed by the standard, but there is no need for implementing them differently.
References only differ from pointers in the way that they are allowed to use by the rules of C++. Of course, because the rules are different, pointers can be used in a way that references can not. And in such case you cannot compare whether pointers generate the same assembly.
The limitations of a reference might make some optimizations easier, so there might be a difference, but such optimization would also have been possible with pointers, so there is no guarantee of different assembly output when using references instead of pointers.
And how does it differ in clang?
In general: It depends.
Both compilers are bound by the same rules of the standard. They might generate identical assembly or different. How the generated assembly of particular version of one compiler differs (if at all) from the assembly generated by a particular version of another compiler, with particular compilation options for each, on particular processor architecture on particular operating system, in a particular use case of a reference, can be found by inspecting and comparing the generated assembly in each particular case.

unique_ptr vs class instance as member variable

There is a class SomeClass which holds some data and methods that operates on this data. And it must be created with some arguments like:
SomeClass(int some_val, float another_val);
There is another class, say Manager, which includes SomeClass, and heavily uses its methods.
So, what would be better in terms of performance (data locality, cache hits, etc.), declare object of SomeClass as member of Manager and use member initialization in Manager's constructor or declare object of SomeClass as unique_ptr?
class Manager
{
public:
Manager() : some(5, 3.0f) {}
private:
SomeClass some;
};
or
class Manager
{
public:
Manager();
private:
std::unique_ptr<SomeClass> some;
}

Short answer
Most likely, there is no difference in runtime efficiency of accessing your subobject. But using pointer can be slower for several reasons (see details below).
Moreover, there are several other things you should remember:
When using pointer, you usually have to allocate/deallocate memory for subobject separately, which takes some time (quite a lot if you do it much).
When using pointer, you can cheaply move your subobject without copying.
Speaking of compile times, pointer is better than plain member. With plain member, you cannot remove dependency of Manager declaration on SomeClass declaration. With pointers, you can do it with forward declaration. Less dependencies may result is less build times.
Details
I'd like to provide more details about performance of subobject accesses. I think that using pointer can be slower than using plain member for several reasons:
Data locality (and cache performance) is likely to be better with plain member. You usually access data of Manager and SomeClass together, and plain member is guaranteed to be near other data, while heap allocations may place object and subobject far from each other.
Using pointer means one more level of indirection. To get address of a plain member, you can simply add a compile-time constant offset fo object address (which is often merged with other assembly instruction). When using pointer, you have to additionally read a word from the member pointer to get actual pointer to subobject. See Q1 and Q2 for more details.
Aliasing is perhaps the most important issue. If you are using plain member, then compiler can assume that: your subobject lies fully within your object in memory, and it does not overlap with other members of your object. When using pointer, compiler often cannot assume anything like this: you subobject may overlap with your object and its members. As a result, compiler has to generate more useless load/store operations, because it thinks that some values may change.
Here is an example for the last issue (full code is here):
struct IntValue {
int x;
IntValue(int x) : x(x) {}
};
class MyClass_Ptr {
unique_ptr<IntValue> a, b, c;
public:
void Compute() {
a->x += b->x + c->x;
b->x += a->x + c->x;
c->x += a->x + b->x;
}
};
Clearly, it is stupid to store subobjects a, b, c by pointers. I've measured time spent in one billion calls of Compute method for a single object. Here are results with different configurations:
2.3 sec: plain member (MinGW 5.1.0)
2.0 sec: plain member (MSVC 2013)
4.3 sec: unique_ptr (MinGW 5.1.0)
9.3 sec: unique_ptr (MSVC 2013)
When looking at the generated assembly for innermost loop in each case, it is easy to understand why the times are so different:
;;; plain member (GCC)
lea edx, [rcx+rax] ; well-optimized code: only additions on registers
add r8d, edx ; all 6 additions present (no CSE optimization)
lea edx, [r8+rax] ; ('lea' instruction is also addition BTW)
add ecx, edx
lea edx, [r8+rcx]
add eax, edx
sub r9d, 1
jne .L3
;;; plain member (MSVC)
add ecx, r8d ; well-optimized code: only additions on registers
add edx, ecx ; 5 additions instead of 6 due to a common subexpression eliminated
add ecx, edx
add r8d, edx
add r8d, ecx
dec r9
jne SHORT $LL6#main
;;; unique_ptr (GCC)
add eax, DWORD PTR [rcx] ; slow code: a lot of memory accesses
add eax, DWORD PTR [rdx] ; each addition loads value from memory
mov DWORD PTR [rdx], eax ; each sum is stored to memory
add eax, DWORD PTR [r8] ; compiler is afraid that some values may be at same address
add eax, DWORD PTR [rcx]
mov DWORD PTR [rcx], eax
add eax, DWORD PTR [rdx]
add eax, DWORD PTR [r8]
sub r9d, 1
mov DWORD PTR [r8], eax
jne .L4
;;; unique_ptr (MSVC)
mov r9, QWORD PTR [rbx] ; awful code: 15 loads, 3 stores
mov rcx, QWORD PTR [rbx+8] ; compiler thinks that values may share
mov rdx, QWORD PTR [rbx+16] ; same address with pointers to values!
mov r8d, DWORD PTR [rcx]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
mov r8, QWORD PTR [rbx+8]
mov rcx, QWORD PTR [rbx] ; load value of 'a' pointer from memory
mov rax, QWORD PTR [rbx+16]
mov edx, DWORD PTR [rcx] ; load value of 'a->x' from memory
add edx, DWORD PTR [rax] ; add the 'c->x' value
add DWORD PTR [r8], edx ; add sum 'a->x + c->x' to 'b->x'
mov r9, QWORD PTR [rbx+16]
mov rax, QWORD PTR [rbx] ; load value of 'a' pointer again =)
mov rdx, QWORD PTR [rbx+8]
mov r8d, DWORD PTR [rax]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
dec rsi
jne SHORT $LL3#main

How does returning values from a function work?

I recently had a serious bug, where I forgot to return a value in a function. The problem was that even though nothing was returned it worked fine under Linux/Windows and only crashed under Mac. I discovered the bug when I turned on all compiler warnings.
So here is a simple example:
#include <iostream>
class A{
public:
A(int p1, int p2, int p3): v1(p1), v2(p2), v3(p3)
{
}
int v1;
int v2;
int v3;
};
A* getA(){
A* p = new A(1,2,3);
// return p;
}
int main(){
A* a = getA();
std::cerr << "A: v1=" << a->v1 << " v2=" << a->v2 << " v3=" << a->v3 << std::endl;
return 0;
}
My question is how can this work under Linux/Windows without crashing? How is the returning of values done on lower level?

On Intel architecture, simple values (integers and pointers) are usually returned in eax register. This register (among others) is also used as temporary storage when moving values in memory and as operand during calculations. So whatever value left in that register is treated as the return value, and in your case it turned out to be exactly what you wanted to be returned.

Probably by luck, 'a' left in a register that happens to be used for returning single pointer results, something like that.
The calling/ conventions and function result returns are architecture-dependent, so it's not surprising that your code works on Windows/Linux but not on a Mac.

There are two major ways for a compiler to return a value:
Put a value in a register before returning, and
Have the caller pass a block of stack memory for the return value, and write the value into that block [more info]
The #1 is usually used with anything that fits into a register; #2 is for everything else (large structs, arrays, et cetera).
In your case, the compiler uses #1 both for the return of new and for the return of your function. On Linux and Windows, the compiler did not perform any value-distorting operations on the register with the returned value between writing it into the pointer variable and returning from your function; on Mac, it did. Hence the difference in the results that you see: in the first case, the left-over value in the return register happened to co-inside with the value that you wanted to return anyway.

First off, you need to slightly modify your example to get it to compile. The function must have at least an execution path that returns a value.
A* getA(){
if(false)
return NULL;
A* p = new A(1,2,3);
// return p;
}
Second, it's obviously undefined behavior, which means anything can happen, but I guess this answer won't satisfy you.
Third, in Windows it works in Debug mode, but if you compile under Release, it doesn't.
The following is compiled under Debug:
A* p = new A(1,2,3);
00021535 push 0Ch
00021537 call operator new (211FEh)
0002153C add esp,4
0002153F mov dword ptr [ebp-0E0h],eax
00021545 mov dword ptr [ebp-4],0
0002154C cmp dword ptr [ebp-0E0h],0
00021553 je getA+7Eh (2156Eh)
00021555 push 3
00021557 push 2
00021559 push 1
0002155B mov ecx,dword ptr [ebp-0E0h]
00021561 call A::A (21271h)
00021566 mov dword ptr [ebp-0F4h],eax
0002156C jmp getA+88h (21578h)
0002156E mov dword ptr [ebp-0F4h],0
00021578 mov eax,dword ptr [ebp-0F4h]
0002157E mov dword ptr [ebp-0ECh],eax
00021584 mov dword ptr [ebp-4],0FFFFFFFFh
0002158B mov ecx,dword ptr [ebp-0ECh]
00021591 mov dword ptr [ebp-14h],ecx
The second instruction, the call to operator new, moves into eax the pointer to the newly created instance.
A* a = getA();
0010484E call getA (1012ADh)
00104853 mov dword ptr [a],eax
The calling context expects eax to contain the returned value, but it does not, it contains the last pointer allocated by new, which is incidentally, p.
So that's why it works.

As Kerrek SB mentioned, your code has ventured into the realm of undefined behavior.
Basically, your code is going to compile down to assembly. In assembly, there's no concept of a function requiring a return type, there's just an expectation. I'm the most comfortable with MIPS, so I shall use MIPS to illustrate.
Assume you have the following code:
int add(x, y)
{
return x + y;
}
This is going to be translated to something like:
add:
add $v0, $a0, $a1 #add $a0 and $a1 and store it in $v0
jr $ra #jump back to where ever this code was jumped to from
To add 5 and 4, the code would be called something like:
addi $a0, $0, 5 # 5 is the first param
addi $a1, $0, 4 # 4 is the second param
jal add
# $v0 now contains 9
Note that unlike C, there's no explicit requirement that $v0 contain the return value, just an expectation. So, what happens if you don't actually push anything into $v0? Well, $v0 always has some value, so the value will be whatever it last was.
Note: This post makes some simplifications. Also, you're computer is likely not running MIPS... But hopefully the example holds, and if you learned assembly at a university, MIPS might be what you know anyway.

The way of returning of value from the function depends on architecture and the type of value. It could be done thru registers or thru stack.
Typically in the x86 architecture the value is returned in EAX register if it is an integral type: char, int or pointer.
When you don't specify the return value, that value is undefined. This is only your luck that your code sometimes worked correctly.

When popping values from the stack in IBM PC architecture there is no physical destruction of the old values of data stored there. They just become unavailable through the operation of the stack, but still remain in the same memory cell.
Of course, the previous values of these data will be destroyed during the subsequent pushing of new data on the stack.
So probably you are just lucky enough, and nothing is added to stack during your function's call and return surrounding code.

Regarding the following statement from n3242 draft C++ Standard, paragraph 6.6.3.2, your example yields undefined behavior:
Flowing off the end of a function is equivalent to a return with no
value; this results in undefined behavior in a value-returning
function.
The best way to see what actually happens is to check the assembly code generated by the given compiler on a given architecture. For the following code:
#pragma warning(default:4716)
int foo(int a, int b)
{
int c = a + b;
}
int main()
{
int n = foo(1, 2);
}
...VS2010 compiler (in Debug mode, on Intel 32-bit machine) generates the following assembly:
#pragma warning(default:4716)
int foo(int a, int b)
{
011C1490 push ebp
011C1491 mov ebp,esp
011C1493 sub esp,0CCh
011C1499 push ebx
011C149A push esi
011C149B push edi
011C149C lea edi,[ebp-0CCh]
011C14A2 mov ecx,33h
011C14A7 mov eax,0CCCCCCCCh
011C14AC rep stos dword ptr es:[edi]
int c = a + b;
011C14AE mov eax,dword ptr [a]
011C14B1 add eax,dword ptr [b]
011C14B4 mov dword ptr [c],eax
}
...
int main()
{
011C14D0 push ebp
011C14D1 mov ebp,esp
011C14D3 sub esp,0CCh
011C14D9 push ebx
011C14DA push esi
011C14DB push edi
011C14DC lea edi,[ebp-0CCh]
011C14E2 mov ecx,33h
011C14E7 mov eax,0CCCCCCCCh
011C14EC rep stos dword ptr es:[edi]
int n = foo(1, 2);
011C14EE push 2
011C14F0 push 1
011C14F2 call foo (11C1122h)
011C14F7 add esp,8
011C14FA mov dword ptr [n],eax
}
The result of addition operation in foo() is stored in eax register (accumulator) and its content is used as a return value of the function, moved to variable n.
eax is used to store a return value (pointer) in the following example as well:
#pragma warning(default:4716)
int* foo(int a)
{
int* p = new int(a);
}
int main()
{
int* pn = foo(1);
if(pn)
{
int n = *pn;
delete pn;
}
}
Assembly code:
#pragma warning(default:4716)
int* foo(int a)
{
000C1520 push ebp
000C1521 mov ebp,esp
000C1523 sub esp,0DCh
000C1529 push ebx
000C152A push esi
000C152B push edi
000C152C lea edi,[ebp-0DCh]
000C1532 mov ecx,37h
000C1537 mov eax,0CCCCCCCCh
000C153C rep stos dword ptr es:[edi]
int* p = new int(a);
000C153E push 4
000C1540 call operator new (0C1253h)
000C1545 add esp,4
000C1548 mov dword ptr [ebp-0D4h],eax
000C154E cmp dword ptr [ebp-0D4h],0
000C1555 je foo+50h (0C1570h)
000C1557 mov eax,dword ptr [ebp-0D4h]
000C155D mov ecx,dword ptr [a]
000C1560 mov dword ptr [eax],ecx
000C1562 mov edx,dword ptr [ebp-0D4h]
000C1568 mov dword ptr [ebp-0DCh],edx
000C156E jmp foo+5Ah (0C157Ah)
std::operator<<<std::char_traits<char> >:
000C1570 mov dword ptr [ebp-0DCh],0
000C157A mov eax,dword ptr [ebp-0DCh]
000C1580 mov dword ptr [p],eax
}
...
int main()
{
000C1610 push ebp
000C1611 mov ebp,esp
000C1613 sub esp,0E4h
000C1619 push ebx
000C161A push esi
000C161B push edi
000C161C lea edi,[ebp-0E4h]
000C1622 mov ecx,39h
000C1627 mov eax,0CCCCCCCCh
000C162C rep stos dword ptr es:[edi]
int* pn = foo(1);
000C162E push 1
000C1630 call foo (0C124Eh)
000C1635 add esp,4
000C1638 mov dword ptr [pn],eax
if(pn)
000C163B cmp dword ptr [pn],0
000C163F je main+51h (0C1661h)
{
int n = *pn;
000C1641 mov eax,dword ptr [pn]
000C1644 mov ecx,dword ptr [eax]
000C1646 mov dword ptr [n],ecx
delete pn;
000C1649 mov eax,dword ptr [pn]
000C164C mov dword ptr [ebp-0E0h],eax
000C1652 mov ecx,dword ptr [ebp-0E0h]
000C1658 push ecx
000C1659 call operator delete (0C1249h)
000C165E add esp,4
}
}
VS2010 compiler issues warning 4716 in both examples. By default this warning is promoted to an error.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js