unique_ptr vs class instance as member variable

unique_ptr vs class instance as member variable - c++

There is a class SomeClass which holds some data and methods that operates on this data. And it must be created with some arguments like:
SomeClass(int some_val, float another_val);
There is another class, say Manager, which includes SomeClass, and heavily uses its methods.
So, what would be better in terms of performance (data locality, cache hits, etc.), declare object of SomeClass as member of Manager and use member initialization in Manager's constructor or declare object of SomeClass as unique_ptr?
class Manager
{
public:
Manager() : some(5, 3.0f) {}
private:
SomeClass some;
};
or
class Manager
{
public:
Manager();
private:
std::unique_ptr<SomeClass> some;
}

Short answer
Most likely, there is no difference in runtime efficiency of accessing your subobject. But using pointer can be slower for several reasons (see details below).
Moreover, there are several other things you should remember:
When using pointer, you usually have to allocate/deallocate memory for subobject separately, which takes some time (quite a lot if you do it much).
When using pointer, you can cheaply move your subobject without copying.
Speaking of compile times, pointer is better than plain member. With plain member, you cannot remove dependency of Manager declaration on SomeClass declaration. With pointers, you can do it with forward declaration. Less dependencies may result is less build times.
Details
I'd like to provide more details about performance of subobject accesses. I think that using pointer can be slower than using plain member for several reasons:
Data locality (and cache performance) is likely to be better with plain member. You usually access data of Manager and SomeClass together, and plain member is guaranteed to be near other data, while heap allocations may place object and subobject far from each other.
Using pointer means one more level of indirection. To get address of a plain member, you can simply add a compile-time constant offset fo object address (which is often merged with other assembly instruction). When using pointer, you have to additionally read a word from the member pointer to get actual pointer to subobject. See Q1 and Q2 for more details.
Aliasing is perhaps the most important issue. If you are using plain member, then compiler can assume that: your subobject lies fully within your object in memory, and it does not overlap with other members of your object. When using pointer, compiler often cannot assume anything like this: you subobject may overlap with your object and its members. As a result, compiler has to generate more useless load/store operations, because it thinks that some values may change.
Here is an example for the last issue (full code is here):
struct IntValue {
int x;
IntValue(int x) : x(x) {}
};
class MyClass_Ptr {
unique_ptr<IntValue> a, b, c;
public:
void Compute() {
a->x += b->x + c->x;
b->x += a->x + c->x;
c->x += a->x + b->x;
}
};
Clearly, it is stupid to store subobjects a, b, c by pointers. I've measured time spent in one billion calls of Compute method for a single object. Here are results with different configurations:
2.3 sec: plain member (MinGW 5.1.0)
2.0 sec: plain member (MSVC 2013)
4.3 sec: unique_ptr (MinGW 5.1.0)
9.3 sec: unique_ptr (MSVC 2013)
When looking at the generated assembly for innermost loop in each case, it is easy to understand why the times are so different:
;;; plain member (GCC)
lea edx, [rcx+rax] ; well-optimized code: only additions on registers
add r8d, edx ; all 6 additions present (no CSE optimization)
lea edx, [r8+rax] ; ('lea' instruction is also addition BTW)
add ecx, edx
lea edx, [r8+rcx]
add eax, edx
sub r9d, 1
jne .L3
;;; plain member (MSVC)
add ecx, r8d ; well-optimized code: only additions on registers
add edx, ecx ; 5 additions instead of 6 due to a common subexpression eliminated
add ecx, edx
add r8d, edx
add r8d, ecx
dec r9
jne SHORT $LL6#main
;;; unique_ptr (GCC)
add eax, DWORD PTR [rcx] ; slow code: a lot of memory accesses
add eax, DWORD PTR [rdx] ; each addition loads value from memory
mov DWORD PTR [rdx], eax ; each sum is stored to memory
add eax, DWORD PTR [r8] ; compiler is afraid that some values may be at same address
add eax, DWORD PTR [rcx]
mov DWORD PTR [rcx], eax
add eax, DWORD PTR [rdx]
add eax, DWORD PTR [r8]
sub r9d, 1
mov DWORD PTR [r8], eax
jne .L4
;;; unique_ptr (MSVC)
mov r9, QWORD PTR [rbx] ; awful code: 15 loads, 3 stores
mov rcx, QWORD PTR [rbx+8] ; compiler thinks that values may share
mov rdx, QWORD PTR [rbx+16] ; same address with pointers to values!
mov r8d, DWORD PTR [rcx]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
mov r8, QWORD PTR [rbx+8]
mov rcx, QWORD PTR [rbx] ; load value of 'a' pointer from memory
mov rax, QWORD PTR [rbx+16]
mov edx, DWORD PTR [rcx] ; load value of 'a->x' from memory
add edx, DWORD PTR [rax] ; add the 'c->x' value
add DWORD PTR [r8], edx ; add sum 'a->x + c->x' to 'b->x'
mov r9, QWORD PTR [rbx+16]
mov rax, QWORD PTR [rbx] ; load value of 'a' pointer again =)
mov rdx, QWORD PTR [rbx+8]
mov r8d, DWORD PTR [rax]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
dec rsi
jne SHORT $LL3#main

Related

Why are parameters allocated below the frame pointer instead of above?

I have tried to understand this basing on a square function in c++ at godbolt.org . Clearly, return, parameters and local variables use “rbp - alignment” for this function.
Could someone please explain how this is possible?
What then would rbp + alignment do in this case?
int square(int num){
int n = 5;// just to test how locals are treated with frame pointer
return num * num;
}
Compiler (x86-64 gcc 11.1)
Generated Assembly:
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi. ;\\Both param and local var use rbp-*
mov DWORD PTR[rbp-4], 5. ;//
mov eax, DWORD PTR [rbp-20]
imul eax, eax
pop rbp
ret

This is one of those cases where it’s handy to distinguish between parameters and arguments. In short: arguments are the values given by the caller, while parameters are the variables holding them.
When square is called, the caller places the argument in the rdi register, in accordance with the standard x86-64 calling convention. square then allocates a local variable, the parameter, and places the argument in the parameter. This allows the parameter to be used like any other variable: be read, written into, having its address taken, and so on. Since in this case it’s the callee that allocated the memory for the parameter, it necessarily has to reside below the frame pointer.
With an ABI where arguments are passed on the stack, the callee would be able to reuse the stack slot containing the argument as the parameter. This is exactly what happens on x86-32 (pass -m32 to see yourself):
square(int): # #square(int)
push ebp
mov ebp, esp
push eax
mov eax, dword ptr [ebp + 8]
mov dword ptr [ebp - 4], 5
mov eax, dword ptr [ebp + 8]
imul eax, dword ptr [ebp + 8]
add esp, 4
pop ebp
ret
Of course, if you enabled optimisations, the compiler would not bother with allocating a parameter on the stack in the callee; it would just use the value in the register directly:
square(int): # #square(int)
mov eax, edi
imul eax, edi
ret

GCC allows "leaf" functions, those that don't call other functions, to not bother creating a stack frame. The free stack is fair game to do so as these fns wish.

Should I create object on a free store only for one call?

Assuming that code is located inside if block, what are differences between creating object in a free store and doing only one call on it:
auto a = aFactory.createA();
int result = a->foo(5);
and making call directly on returned pointer?
int result = aFactory.createA()->foo(5);
Is there any difference in performance? Which way is better?
#include <iostream>
#include <memory>
class A
{
public:
int foo(int a){return a+3;}
};
class AFactory
{
public:
std::unique_ptr<A> createA(){return std::make_unique<A>();}
};
int main()
{
AFactory aFactory;
bool condition = true;
if(condition)
{
auto a = aFactory.createA();
int result = a->foo(5);
}
}

Look here. There is no difference in code generated for both versions even with optimisations disabled.
Using gcc7.1 with -std=c++1z -O0
auto a = aFactory.createA();
int result = a->foo(5);
is compiled to:
lea rax, [rbp-24]
lea rdx, [rbp-9]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
and int result = aFactory.createA()->foo(5); to:
lea rax, [rbp-16]
lea rdx, [rbp-17]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
So they are pretty much identical.
This outcome is understandable when you realise the only difference between the two versions is that in the first one we assign the name to our object, while in the second we work with an unnamed one. Other then that, they are both created on heap and used the same way. And since variable name means nothing to the compiler - it is only relevant for code-readers - it treats the two as if they were identical.

In your simple case it will not make a difference, because the (main) function ends right after creating and using a.
If some more lines of code would follow, the destruction of the a object would happen at the end of the if block in main, while in the one line case it becomes destructed at the end of that single line. However, it would be bad design if the destructor of a more sophisticated class A would make a difference on that.
Due to compiler optimizations performance questions should always be answered by testing with a profiler on the concrete code.

C++ Return Performance

I have a question about performance. I think this can also applies to other languages (not only C++).
Imagine that I have this function:
int addNumber(int a, int b){
int result = a + b;
return result;
}
Is there any performance improvement if I write the code above like this?
int addNumber(int a, int b){
return a + b;
}
I have this question because the second function doesn´t declare a 3rd variable. But would the compiler detect this in the first code?

To answer this question you can look at the generated assembler code. With -O2, x86-64 gcc 6.2 generates exactly the same code for both methods:
addNumber(int, int):
lea eax, [rdi+rsi]
ret
addNumber2(int, int):
lea eax, [rdi+rsi]
ret
Only without optimization turned on, there is a difference:
addNumber(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add eax, edx
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
addNumber2(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
However, performance comparison without optimization is meaningless

In principle there is no difference between the two approaches. The majority of compilers have handled this type of optimisation for some decades.
Additionally, if the function can be inlined (e.g. its definition is visible to the compiler when compiling code that uses such a function) the majority of compilers will eliminate the function altogether, and simply emit code to add the two variables passed and store the result as required by the caller.
Obviously, the comments above assume compiling with a relevant optimisation setting (e.g. not doing a debug build without optimisation).
Personally, I would not write such a function anyway. It is easier, in the caller, to write c = a + b instead of c = addNumber(a, b), so having a function like that offers no benefit to either programmer (effort to understand) or program (performance, etc). You might as well write comments that give no useful information.
c = a + b; // add a and b and store into c
Any self-respecting code reviewer would complain bitterly about uninformative functions or uninformative comments.
I'd only use such a function if its name conveyed some special meaning (i.e. more than just adding two values) for the application
c = FunkyOperation(a,b);
int FunkyOperation(int a, int b)
{
/* Many useful ways of implementing this operation.
One of those ways happens to be addition, but we need to
go through 25 pages of obscure mathematical proof to
realise that
*/
return a + b;
}

C/C++ returning struct by value under the hood

(This question is specific to my machine's architecture and calling conventions, Windows x86_64)
I don't exactly remember where I had read this, or if I had recalled it correctly, but I had heard that, when a function should return some struct or object by value, it will either stuff it in rax (if the object can fit in the register width of 64 bits) or be passed a pointer to where the resulting object would be (I'm guessing allocated in the calling function's stack frame) in rcx, where it would do all the usual initialization, and then a mov rax, rcx for the return trip. That is, something like
extern some_struct create_it(); // implemented in assembly
would really have a secret parameter like
extern some_struct create_it(some_struct* secret_param_pointing_to_where_i_will_be);
Did my memory serve me right, or am I incorrect? How are large objects (i.e. wider than the register width) returned by value from functions?

Here's a simple disassembling of a code exampling what you're saying
typedef struct
{
int b;
int c;
int d;
int e;
int f;
int g;
char x;
} A;
A foo(int b, int c)
{
A myA = {b, c, 5, 6, 7, 8, 10};
return myA;
}
int main()
{
A myA = foo(5,9);
return 0;
}
and here's the disassembly of the foo function, and the main function calling it
main:
push ebp
mov ebp, esp
and esp, 0FFFFFFF0h
sub esp, 30h
call ___main
lea eax, [esp+20] ; placing the addr of myA in eax
mov dword ptr [esp+8], 9 ; param passing
mov dword ptr [esp+4], 5 ; param passing
mov [esp], eax ; passing myA addr as a param
call _foo
mov eax, 0
leave
retn
foo:
push ebp
mov ebp, esp
sub esp, 20h
mov eax, [ebp+12]
mov [ebp-28], eax
mov eax, [ebp+16]
mov [ebp-24], eax
mov dword ptr [ebp-20], 5
mov dword ptr [ebp-16], 6
mov dword ptr [ebp-12], 7
mov dword ptr [ebp-8], 9
mov byte ptr [ebp-4], 0Ah
mov eax, [ebp+8]
mov edx, [ebp-28]
mov [eax], edx
mov edx, [ebp-24]
mov [eax+4], edx
mov edx, [ebp-20]
mov [eax+8], edx
mov edx, [ebp-16]
mov [eax+0Ch], edx
mov edx, [ebp-12]
mov [eax+10h], edx
mov edx, [ebp-8]
mov [eax+14h], edx
mov edx, [ebp-4]
mov [eax+18h], edx
mov eax, [ebp+8]
leave
retn
now let's go through what just happened, so when calling foo the paramaters were passed in the following way, 9 was at highest address, then 5 then the address the myA in main begins
lea eax, [esp+20] ; placing the addr of myA in eax
mov dword ptr [esp+8], 9 ; param passing
mov dword ptr [esp+4], 5 ; param passing
mov [esp], eax ; passing myA addr as a param
within foo there is some local myA which is stored on the stack frame, since the stack is going downwards, the lowest address of myA begins in [ebp - 28], the -28 offset could be caused by struct alignments so I'm guessing the size of the struct should be 28 bytes here and not 25 as expected. and as we can see in foo after the local myA of foo was created and filled with parameters and immediate values, it is copied and re-written to the address of myA passed from main ( this is the actual meaning of return by value )
mov eax, [ebp+8]
mov edx, [ebp-28]
[ebp + 8] is where the address of main::myA was stored ( memory address go upwards hence ebp + old ebp ( 4 bytes ) + return address ( 4 bytes )) at overall ebp + 8 to get to the first byte of main::myA, as said earlier foo::myA is stored within [ebp-28] as stack goes downwards
mov [eax], edx
place foo::myA.b in the address of the first data member of main::myA which is main::myA.b
mov edx, [ebp-24]
mov [eax+4], edx
place the value that resides in the address of foo::myA.c in edx, and place that value within the address of main::myA.b + 4 bytes which is main::myA.c
as you can see this process repeats itself through out the function
mov edx, [ebp-20]
mov [eax+8], edx
mov edx, [ebp-16]
mov [eax+0Ch], edx
mov edx, [ebp-12]
mov [eax+10h], edx
mov edx, [ebp-8]
mov [eax+14h], edx
mov edx, [ebp-4]
mov [eax+18h], edx
mov eax, [ebp+8]
which basically proves that when returning a struct by val, that could not be placed in as a param, what happens is that the address of where the return value should reside in is passed as a param to the function and within the function being called the values of the returned struct are copied into the address passed as a parameter...
hope this exampled helped you visualize what happens under the hood a little bit better :)
EDIT
I hope that you've noticed that my example was using 32 bit assembler and I KNOW you've asked regarding x86-64, but I'm currently unable to disassemble code on a 64 bit machine so I hope you take my word on it that the concept is exactly the same both for 64 bit and 32 bit, and that the calling convention is nearly the same

That is exactly correct. The caller passes an extra argument which is the address of the return value. Normally it will be on the caller's stack frame but there are no guarantees.
The precise mechanics are specified by the platform ABI, but this mechanism is very common.
Various commentators have left useful links with documentation for calling conventions, so I'll hoist some of them into this answer:
Wikipedia article on x86 calling conventions
Agner Fog's collection of optimization resources, including a summary of calling conventions (Direct link to 57-page PDF document.)
Microsoft Developer Network (MSDN) documentation on calling conventions.
StackOverflow x86 tag wiki has lots of useful links.

Cost of accessing data member through pointer

I was curious to see what the cost is of accessing a data member through a pointer compared with not through a pointer, so came up with this test:
#include <iostream>
struct X{
int a;
};
int main(){
X* xheap = new X();
std::cin >> xheap->a;
volatile int x = xheap->a;
X xstack;
std::cin >> xstack.a;
volatile int y = xstack.a;
}
the generated x86 is:
int main(){
push rbx
sub rsp,20h
X* xheap = new X();
mov ecx,4
call qword ptr [__imp_operator new (013FCD3158h)]
mov rbx,rax
test rax,rax
je main+1Fh (013FCD125Fh)
xor eax,eax
mov dword ptr [rbx],eax
jmp main+21h (013FCD1261h)
xor ebx,ebx
std::cin >> xheap->a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov rdx,rbx
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int x = xheap->a;
mov eax,dword ptr [rbx]
X xstack;
std::cin >> xstack.a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov dword ptr [x],eax
lea rdx,[xstack]
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int y = xstack.a;
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
It looks like the non-pointer access takes two instructions, compared to oneinstruction for the access through a pointer. Could somebody please tell me why this is and which would take fewer CPU cycles to retrieve?
I am trying to understand if pointers do incur more CPU instructions/cycles when accessing data members through them as opposed to non-pointer-access.

That's a terrible test.
The complete assignment to x is this:
mov eax,dword ptr [rbx]
mov dword ptr [x],eax
(the compiler is allowed to re-order the instructions somewhat, and has).
The assignment to y (which the compiler has given the same address as x) is
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
which is almost the same (read memory pointed to by register, write to the stack).
The first one would be more complicated except that the compiler kept xheap in register rbx after the call to new, so it doesn't need to re-load it.
In either case I would be more worried about whether any of those accesses misses the L1 or L2 caches than about the precise instructions. (The processor doesn't even directly execute those instructions, they get converted internally to a different instruction set, and it may execute them in a different order.)
Accessing via a pointer instead of directly accessing from the stack costs you one extra indirection in the worst case (fetching the pointer). This is almost always irrelevant in itself; you need to look at your whole algorithm and how it works with the processor's caches and branch prediction logic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

unique_ptr vs class instance as member variable - c++

Related

Why are parameters allocated below the frame pointer instead of above?

Should I create object on a free store only for one call?

C++ Return Performance

C/C++ returning struct by value under the hood

Cost of accessing data member through pointer

Categories

Resources