Related
I have tried to understand this basing on a square function in c++ at godbolt.org . Clearly, return, parameters and local variables use “rbp - alignment” for this function.
Could someone please explain how this is possible?
What then would rbp + alignment do in this case?
int square(int num){
int n = 5;// just to test how locals are treated with frame pointer
return num * num;
}
Compiler (x86-64 gcc 11.1)
Generated Assembly:
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi. ;\\Both param and local var use rbp-*
mov DWORD PTR[rbp-4], 5. ;//
mov eax, DWORD PTR [rbp-20]
imul eax, eax
pop rbp
ret
This is one of those cases where it’s handy to distinguish between parameters and arguments. In short: arguments are the values given by the caller, while parameters are the variables holding them.
When square is called, the caller places the argument in the rdi register, in accordance with the standard x86-64 calling convention. square then allocates a local variable, the parameter, and places the argument in the parameter. This allows the parameter to be used like any other variable: be read, written into, having its address taken, and so on. Since in this case it’s the callee that allocated the memory for the parameter, it necessarily has to reside below the frame pointer.
With an ABI where arguments are passed on the stack, the callee would be able to reuse the stack slot containing the argument as the parameter. This is exactly what happens on x86-32 (pass -m32 to see yourself):
square(int): # #square(int)
push ebp
mov ebp, esp
push eax
mov eax, dword ptr [ebp + 8]
mov dword ptr [ebp - 4], 5
mov eax, dword ptr [ebp + 8]
imul eax, dword ptr [ebp + 8]
add esp, 4
pop ebp
ret
Of course, if you enabled optimisations, the compiler would not bother with allocating a parameter on the stack in the callee; it would just use the value in the register directly:
square(int): # #square(int)
mov eax, edi
imul eax, edi
ret
GCC allows "leaf" functions, those that don't call other functions, to not bother creating a stack frame. The free stack is fair game to do so as these fns wish.
Let's assume I pass a new object to a function like this:
loadContainer->addControlView( new BmpView( BMP_PICTURE ) );
Now, I want to change a specific characteristic of the BmpView before I pass it to addControlView. The way I do this is like this:
Control* newView = new BmpView( BMP_PICTURE );
newView->changeColor( WHITE );
loadContainer->addControlView( newView );
Does this create an extra temporary/local object? Or is there an equal amount of memory allocated in both cases?
The only added memory allocated in your function is a new pointer *newView, which its size is pretty low and doesn't affect by the actual size of the BmpView. It doesn't allocate twice memory for BmpView.
I'm not considering any memory overhead of calling changeColor, which I assume wasn't the point of this question.
In both cases, there is single call to new, hence the amount of memory used is equal. (Note, this is without speculating about any allocation possibly requested by BmpView constructor, changeColor(), etc.)
However, you may wish to refactor your code to ensure some exception safety, avoid potential leaks hence ensure the amount of memory used is under control:
// C++11
std::unique_ptr<Control> newView(new BmpView(BMP_PICTURE));
// C++14, preferred
//auto newView = std::make_unique<BmpView>(BMP_PICTURE);
newView->changeColor( WHITE );
loadContainer->addControlView( newView.release() );
Reference for the below code/assembly :
https://godbolt.org/g/8DgmC1
#include <cstdio>
class ValueClass {
public:
// Class content not important...
int someValue;
};
void PrintValueClass(ValueClass* ptr) {
printf("%d\n", ptr->someValue);
}
int main() {
PrintValueClass(new ValueClass());
ValueClass* pValueClass = new ValueClass();
pValueClass->someValue = 55;
PrintValueClass(pValueClass);
return 1;
}
Compiled Assembly (PrintValueClass redacted as not important to question at hand) :
Example where you pass the (new ValueClass) directly to the function.
main:
push rbp
mov rbp, rsp
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 0
mov rdi, rax
call PrintValueClass(ValueClass*)
mov eax, 1
pop rbp
ret
Example where you create a local variable holding the pointer, do something to it, and then pass it to the function.
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 0
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 55
mov rax, QWORD PTR [rbp-8]
mov rdi, rax
call PrintValueClass(ValueClass*)
mov eax, 1
leave
ret
Before diving into the assembly, if your question is does the new operation occur twice if you store the pointer in a variable first, the answer is no. As shown through the assembly, the new 'function' is only called once, so only sizeof(ValueClass) is ever being allocated here through some sort of heap allocation function. But I feel the need to answer the question fully, even if it was not exactly intended to ask this question. Is extra memory used? Technically yes, realistically no.
The only difference between these two pieces of code is stack 'allocation', noted by the sub rsp, 16, which essentially means 'allocate' 16 bytes on the stack for local variables. So truly the only difference here is 16 bytes, which will greatly change on what compiler you use, what architecture you target, and many more factors.
At the end of the day, I would go as far to say, you would never care about the extra 16 bytes.
I know this has been discussed a few times, but my situation is a bit different.
I have a third-party dll exporting some classes. Unfortunately, the header file is not available.
It is still possible to call exported functions. But I cannot get around passing the right 'this' pointer (which is passed in RCX register).
First I use dumpbin /exports to extract the function names (names are changed as the third-party library and function names are confidential).
4873 1308 0018B380 ?GetId#ThirdPartyClass#ThirdPartyNamespace##QEBAJXZ = ??GetId#ThirdPartyClass#ThirdPartyNamespace##QEBAJXZ (public: long __cdecl ThirdPartyNamespace::ThirdPartyClass::GetId(void)const )
Now, the API allows me to register my callback that receives a pointer to ThirdPartyNamespace::ThirdPartyClass (there is only forward declaration of ThirdPartyClass).
Here how I am trying to call ThirdPartyNamespace::ThirdPartyClass::GetId():
long (ThirdPartyNamespace::ThirdPartyClass::*_pFnGetId)() const;
HMODULE hModule = GetModuleHandle("ThirdPartyDLL.dll");
*(FARPROC*)&_pFnGetId= GetProcAddress(hModule, "?GetId#ThirdPartyClass#ThirdPartyNamespace##QEBAJXZ");
long id = (ptr->*_pFnGetId)();
Everything looks fine (i.e. if I step in - I get indeed inside ThirdPartyClass::GetId method. But the this pointer is not good. While the ptr is good and if in debugger I manually change rcx to the ptr - it works fine. But compiler does not pass ptr for some reason. Here is disassembly:
long id = (ptr->*_pFnGetId)();
000000005C882362 movsxd rax,dword ptr [rdi+30h]
000000005C882366 test eax,eax
000000005C882368 jne MyClass::MyCallback+223h (05C882373h)
000000005C88236A movsxd rcx,dword ptr [rdi+28h]
000000005C88236E add rcx,rsi
000000005C882371 jmp MyClass::MyCallback+240h (05C882390h)
000000005C882373 movsxd r8,dword ptr [rdi+2Ch]
000000005C882377 mov rcx,rax
000000005C88237A mov rax,qword ptr [r8+rsi]
000000005C88237E movsxd rdx,dword ptr [rax+rcx]
000000005C882382 movsxd rcx,dword ptr [rdi+28h]
000000005C882386 lea rax,[r8+rdx]
000000005C88238A add rcx,rax
000000005C88238D add rcx,rsi
000000005C882390 call qword ptr [rdi+20h]
000000005C882393 mov ebp,eax
Before executing these commands, rsi contains the pointer to the object of ThirdPartyClass (i.e. ptr), but instead of passing it in rcx directly, some arithmetic is performed on it and as a result, this pointer gets completely wrong.
some traces which I don't understand why compiler is doing it as it end up calling non-virtual function ThirdPartyClass::GetId():
000000005C88237A mov rax,qword ptr [r8+rsi]
R8 0000000000000000
RSI 000000004C691AA0 // good pointer to ThirdPartyClass object
RAX 0000000008E87728 // this gets pointer to virtual functions table of ThirdPartyClass
000000005C88237E movsxd rdx,dword ptr [rax+rcx]
RAX 0000000008E87728
RCX FFFFFFFFFFFFFFFF
RDX FFFFFFFFC0F3C600
000000005C882382 movsxd rcx,dword ptr [rdi+28h]
RCX 0000000000000000
RDI 000000005C9BE690
000000005C882386 lea rax,[r8+rdx]
RAX FFFFFFFFC0F3C600
RDX FFFFFFFFC0F3C600
R8 0000000000000000
000000005C88238A add rcx,rax
RAX FFFFFFFFC0F3C600
RCX FFFFFFFFC0F3C600
000000005C88238D add rcx,rsi
RCX 000000000D5CE0A0
RSI 000000004C691AA0
000000005C882390 call qword ptr [rdi+20h]
In my view, it should be as simple as
long id = (ptr->*_pFnGetId)();
mov rcx,rsi
call qword ptr [rdi+20h]
mov ebp,eax
And if I set rcx equal to rsi before the call qword ptr [rdi+20h] it returns me expected value.
Am I doing something completely wrong?
Thanks in advance.
Ok, I found a solution, by incident (as I already used similar approach and it worked in slightly different situation.
The solution is to trick the compiler by defining a fake class and calling member method by pointer, but pretending that it is a pointer to the known (to compiler) class.
Perhaps, it does not matter, but I know that ThirdPartyNamespace::ThirdPartyClass has virtual functions, so I declare fake class with virtual function as well.
class FakeCall
{
private:
FakeCall(){}
virtual ~FakeCall(){}
};
The rest as in the initial code except once small thing, instead of calling ptr->*_pFnGetId (where ptr is pointer to unknown, forward declared class ThirdPartyNamespace::ThirdPartyClass), I am pretending I am calling member method in my FakeCall class:
FakeCall * fake = (FakeCall*)ptr;
long sico = (fake->*_pFnGetId)();
Disassembly looks exactly as expected:
long sico = (fake->*_pFnGetSico)();
000000005A612096 mov rcx,rax
000000005A612099 call qword ptr [r12+20h]
000000005A61209E mov esi,eax
And it works perfectly!
Some observations:
The member method pointer, as I thought initially, nothing more than a normal function pointer.
Microsoft compiler (at least VS2008) goes crazy if calling member method for not defined class (i.e. only forward declaration of the name).
I have a function which takes 3 arguments, dest, src0, src1, each a pointer to data of size 12. I made two versions. One is written in C and optimized by the compiler, the other one is fully written in _asm. So yeah. 3 arguments? I naturally do something like:
mov ecx, [src0]
mov edx, [src1]
mov eax, [dest]
I am a bit confused by the compiler, as it saw fit to add the following:
_src0$ = -8 ; size = 4
_dest$ = -4 ; size = 4
_src1$ = 8 ; size = 4
?vm_vec_add_scalar_asm##YAXPAUvec3d##PBU1#1#Z PROC ; vm_vec_add_scalar_asm
; _dest$ = ecx
; _src0$ = edx
; 20 : {
sub esp, 8
mov DWORD PTR _src0$[esp+8], edx
mov DWORD PTR _dest$[esp+8], ecx
; 21 : _asm
; 22 : {
; 23 : mov ecx, [src0]
mov ecx, DWORD PTR _src0$[esp+8]
; 24 : mov edx, [src1]
mov edx, DWORD PTR _src1$[esp+4]
; 25 : mov eax, [dest]
mov eax, DWORD PTR _dest$[esp+8]
Function body etc.
add esp, 8
ret 0
What does the _src0$[esp+8] etc. even means? Why does it do all this stuff before my code? Why does it try to [apparently]stack anything so badly?
In comparison, the C++ version has only the following before its body, which is pretty similar:
_src1$ = 8 ; size = 4
?vm_vec_add##YAXPAUvec3d##PBU1#1#Z PROC ; vm_vec_add
; _dest$ = ecx
; _src0$ = edx
mov eax, DWORD PTR _src1$[esp-4]
Why is this little sufficient?
The answer of Mats Petersson explained __fastcall. But I guess that is not exactly what you're asking ...
Actually _src0$[esp+8] just means [_src0$ + esp + 8], and _src0$ is defined above:
_src0$ = -8 ; size = 4
So, the whole expression _src0$[esp+8] is nothing but [esp] ...
To see why it does all these stuff, you should probably first understand what Mats Petersson said in his post, the __fastcall, or more generally, what is a calling convention. See the link in his post for detailed informations.
Assuming that you have understood __fastcall, now let's see what happens to your codes. The compiler is using __fastcall. Your callee function is f(dst, src0, src1), which requires 3 parameters, so according to the calling convention, when a caller calls f, it does the following:
Move dst to ecx and src0 to edx
Push src1 onto the stack
Push the 4 bytes return address onto the stack
Go to the starting address of the function f
And the callee f, when its code begins, then knows where the parameters are: dst and src0 are in the registers ecx and edx, respectively; esp is pointing to the 4 bytes return address, but the 4 bytes below it (i.e. DWORD PTR[esp+4]) is exactly src1.
So, in your "C++ version", the function f just does what it should do:
mov eax, DWORD PTR _src1$[esp-4]
Here _src1$ = 8, so _src1$[esp-4] is exactly [esp+4]. See, it just retrieves the parameter src1 and stores it in eax.
There is however a tricky point here. In the code of f, if you want to use the parameter src1 multiple times, you can certainly do that, because it's always stored in the stack, right below the return address; but what if you want to use dst and src0 multiple times? They are in the registers, and can be destroyed at any time.
So in that case, the compiler should do the following: right after entering the function f, it should remember the current values of ecx and edx (by pushing them onto the stack). These 8 bytes are the so-called "shadow space". It is not done in your "C++ version", probably because the compiler knows for sure that these two parameters will not be used multiple times, or that it can handle it properly some other way.
Now, what happens to your _asm version? The problem here is that you are using inline assembly. The compiler then loses its control to the registers, and it cannot assume that the registers ecx and edx are safe in your _asm block (they are actually not, since you used them in the _asm block). Thus it is forced to save them at the beginning of the function.
The saving goes as follows: it first raises esp by 8 bytes (sub esp, 8), then move edx and ecx to [esp] and [esp+4] respectively.
And then it can enter safely your _asm block. Now in its mind (if it has one), the picture is that [esp] is src0, [esp+4] is dst, [esp+8] is the 4 byte return address, and [esp+12] is src1. It no longer thinks about ecx and edx.
Thus your first instruction in the _asm block, mov ecx, [src0], should be interpreted as mov ecx, [esp], which is the same as
mov ecx, DWORD PTR _src0$[esp+8]
and the same for the other two instructions.
At this point, you might say, aha it's doing stupid things, I don't want it to waste time and space on that, is there a way?
Well there is a way - do not use inline assembly... it's convenient, but there is a compromise.
You can write the assembly function f in a .asm source file and public it. In the C/C++ code, declare it as extern 'C' f(...). Then, when you begin your assembly function f, you can play directly with your ecx and edx.
The compiler has decided to use a calling convention that uses "pass arguments in registers" aka __fastcall. This allows the compiler to pass some of the arguments in registers, instead of pushing onto stack, and this can reduce the overhead in the call, because moving from a variable to a register is faster than pushing onto the stack, and it's now already in a register when we get to the callee function, so no need to read it from the stack.
There is a lot more information about how calling conventions work on the web. The wikipedia article on x86 calling conventions is a good starting point.
There is a class SomeClass which holds some data and methods that operates on this data. And it must be created with some arguments like:
SomeClass(int some_val, float another_val);
There is another class, say Manager, which includes SomeClass, and heavily uses its methods.
So, what would be better in terms of performance (data locality, cache hits, etc.), declare object of SomeClass as member of Manager and use member initialization in Manager's constructor or declare object of SomeClass as unique_ptr?
class Manager
{
public:
Manager() : some(5, 3.0f) {}
private:
SomeClass some;
};
or
class Manager
{
public:
Manager();
private:
std::unique_ptr<SomeClass> some;
}
Short answer
Most likely, there is no difference in runtime efficiency of accessing your subobject. But using pointer can be slower for several reasons (see details below).
Moreover, there are several other things you should remember:
When using pointer, you usually have to allocate/deallocate memory for subobject separately, which takes some time (quite a lot if you do it much).
When using pointer, you can cheaply move your subobject without copying.
Speaking of compile times, pointer is better than plain member. With plain member, you cannot remove dependency of Manager declaration on SomeClass declaration. With pointers, you can do it with forward declaration. Less dependencies may result is less build times.
Details
I'd like to provide more details about performance of subobject accesses. I think that using pointer can be slower than using plain member for several reasons:
Data locality (and cache performance) is likely to be better with plain member. You usually access data of Manager and SomeClass together, and plain member is guaranteed to be near other data, while heap allocations may place object and subobject far from each other.
Using pointer means one more level of indirection. To get address of a plain member, you can simply add a compile-time constant offset fo object address (which is often merged with other assembly instruction). When using pointer, you have to additionally read a word from the member pointer to get actual pointer to subobject. See Q1 and Q2 for more details.
Aliasing is perhaps the most important issue. If you are using plain member, then compiler can assume that: your subobject lies fully within your object in memory, and it does not overlap with other members of your object. When using pointer, compiler often cannot assume anything like this: you subobject may overlap with your object and its members. As a result, compiler has to generate more useless load/store operations, because it thinks that some values may change.
Here is an example for the last issue (full code is here):
struct IntValue {
int x;
IntValue(int x) : x(x) {}
};
class MyClass_Ptr {
unique_ptr<IntValue> a, b, c;
public:
void Compute() {
a->x += b->x + c->x;
b->x += a->x + c->x;
c->x += a->x + b->x;
}
};
Clearly, it is stupid to store subobjects a, b, c by pointers. I've measured time spent in one billion calls of Compute method for a single object. Here are results with different configurations:
2.3 sec: plain member (MinGW 5.1.0)
2.0 sec: plain member (MSVC 2013)
4.3 sec: unique_ptr (MinGW 5.1.0)
9.3 sec: unique_ptr (MSVC 2013)
When looking at the generated assembly for innermost loop in each case, it is easy to understand why the times are so different:
;;; plain member (GCC)
lea edx, [rcx+rax] ; well-optimized code: only additions on registers
add r8d, edx ; all 6 additions present (no CSE optimization)
lea edx, [r8+rax] ; ('lea' instruction is also addition BTW)
add ecx, edx
lea edx, [r8+rcx]
add eax, edx
sub r9d, 1
jne .L3
;;; plain member (MSVC)
add ecx, r8d ; well-optimized code: only additions on registers
add edx, ecx ; 5 additions instead of 6 due to a common subexpression eliminated
add ecx, edx
add r8d, edx
add r8d, ecx
dec r9
jne SHORT $LL6#main
;;; unique_ptr (GCC)
add eax, DWORD PTR [rcx] ; slow code: a lot of memory accesses
add eax, DWORD PTR [rdx] ; each addition loads value from memory
mov DWORD PTR [rdx], eax ; each sum is stored to memory
add eax, DWORD PTR [r8] ; compiler is afraid that some values may be at same address
add eax, DWORD PTR [rcx]
mov DWORD PTR [rcx], eax
add eax, DWORD PTR [rdx]
add eax, DWORD PTR [r8]
sub r9d, 1
mov DWORD PTR [r8], eax
jne .L4
;;; unique_ptr (MSVC)
mov r9, QWORD PTR [rbx] ; awful code: 15 loads, 3 stores
mov rcx, QWORD PTR [rbx+8] ; compiler thinks that values may share
mov rdx, QWORD PTR [rbx+16] ; same address with pointers to values!
mov r8d, DWORD PTR [rcx]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
mov r8, QWORD PTR [rbx+8]
mov rcx, QWORD PTR [rbx] ; load value of 'a' pointer from memory
mov rax, QWORD PTR [rbx+16]
mov edx, DWORD PTR [rcx] ; load value of 'a->x' from memory
add edx, DWORD PTR [rax] ; add the 'c->x' value
add DWORD PTR [r8], edx ; add sum 'a->x + c->x' to 'b->x'
mov r9, QWORD PTR [rbx+16]
mov rax, QWORD PTR [rbx] ; load value of 'a' pointer again =)
mov rdx, QWORD PTR [rbx+8]
mov r8d, DWORD PTR [rax]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
dec rsi
jne SHORT $LL3#main