What does Znwm and ZdlPv mean in assembly? - c++

I'm new to assembly and I'm trying to figure out how C++ handles dynamic dispatch in assembly.
When looking through assembly code, I saw that there were 2 unusual calls:
call _Znwm
call _ZdlPv
These did not have a subroutine that I could trace them to. From examining the code, Znwm seemed to return the address of the object when its constructor was called, but I'm not sure about that. ZdlPv was in a block of code that could never be reached (it was jumped over).
C++:
Fruit * f;
f = new Apple();
x86:
# BB#1:
mov eax, 8
mov edi, eax
call _Znwm
mov rdi, rax
mov rcx, rax
.Ltmp6:
mov qword ptr [rbp - 48], rdi # 8-byte Spill
mov rdi, rax
mov qword ptr [rbp - 56], rcx # 8-byte Spill
call _ZN5AppleC2Ev
Any advice would be appreciated.
Thanks.

_Znwm is operator new.
_ZdlPv is operator delete.

Related

Will delete the pointer to pointer cause the memory leak?

I have a question about C-style string symbolic constants and dynamically allocating arrays.
const char** name = new const char* { "Alan" };
delete name;
when I try to delete name after new'ing a piece of memory, the compiler suggest to me to use delete instead of delete[]. I understand name only stores the address of the pointer to the only-read string.
However, if I only delete the pointer to pointer (which is name), will the string itself cause a memory leak?
As the comments above indicate, you don't need to manage the memory that "Alan" exists in.
Let's see what that looks like in practice.
I made a modified version of your code:
#include <iostream>
void test() {
const char** name;
name = new const char* { "Alan\n" };
delete name;
}
int main()
{
test();
}
and then I popped it into godbolt and it shows what's happening under the hood. (excerpts copied below)
In both clang and gcc, the memory that stores "Alan\n" is in static memory so it always exists. This is how it creates no memory leak even though you never touch it again after mentioning it. The value of the pointer to "Alan\n" is just the position in the program's memory, offset .L.str or OFFSET FLAT:.LC0.
clang:
test(): # #test()
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 8
call operator new(unsigned long)
mov rcx, rax
movabs rdx, offset .L.str
mov qword ptr [rax], rdx
mov qword ptr [rbp - 8], rcx
mov rax, qword ptr [rbp - 8]
cmp rax, 0
mov qword ptr [rbp - 16], rax # 8-byte Spill
je .LBB1_2
mov rax, qword ptr [rbp - 16] # 8-byte Reload
mov rdi, rax
call operator delete(void*)
.L.str:
.asciz "Alan\n"
gcc:
.LC0:
.string "Alan\n"
test():
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 8
call operator new(unsigned long)
mov QWORD PTR [rax], OFFSET FLAT:.LC0
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
test rax, rax
je .L3
mov esi, 8
mov rdi, rax
call operator delete(void*, unsigned long)

Why does using the ternary operator to return a string generate considerably different code from returning in an equivalent if/else block?

I was playing with the Compiler Explorer and I stumbled upon an interesting behavior with the ternary operator when using something like this:
std::string get_string(bool b)
{
return b ? "Hello" : "Stack-overflow";
}
The compiler generated code for this (clang trunk, with -O3) is this:
get_string[abi:cxx11](bool): # #get_string[abi:cxx11](bool)
push r15
push r14
push rbx
mov rbx, rdi
mov ecx, offset .L.str
mov eax, offset .L.str.1
test esi, esi
cmovne rax, rcx
add rdi, 16 #< Why is the compiler storing the length of the string
mov qword ptr [rbx], rdi
xor sil, 1
movzx ecx, sil
lea r15, [rcx + 8*rcx]
lea r14, [rcx + 8*rcx]
add r14, 5 #< I also think this is the length of "Hello" (but not sure)
mov rsi, rax
mov rdx, r14
call memcpy #< Why is there a call to memcpy
mov qword ptr [rbx + 8], r14
mov byte ptr [rbx + r15 + 21], 0
mov rax, rbx
pop rbx
pop r14
pop r15
ret
.L.str:
.asciz "Hello"
.L.str.1:
.asciz "Stack-Overflow"
However, the compiler generated code for the following snippet is considerably smaller and with no calls to memcpy, and does not care about knowing the length of both strings at the same time. There are 2 different labels that it jumps to
std::string better_string(bool b)
{
if (b)
{
return "Hello";
}
else
{
return "Stack-Overflow";
}
}
The compiler generated code for the above snippet (clang trunk with -O3) is this:
better_string[abi:cxx11](bool): # #better_string[abi:cxx11](bool)
mov rax, rdi
lea rcx, [rdi + 16]
mov qword ptr [rdi], rcx
test sil, sil
je .LBB0_2
mov dword ptr [rcx], 1819043144
mov word ptr [rcx + 4], 111
mov ecx, 5
mov qword ptr [rax + 8], rcx
ret
.LBB0_2:
movabs rdx, 8606216600190023247
mov qword ptr [rcx + 6], rdx
movabs rdx, 8525082558887720019
mov qword ptr [rcx], rdx
mov byte ptr [rax + 30], 0
mov ecx, 14
mov qword ptr [rax + 8], rcx
ret
The same result is when I use the ternary operator with:
std::string get_string(bool b)
{
return b ? std::string("Hello") : std::string("Stack-Overflow");
}
I would like to know why the ternary operator in the first example generates that compiler code. I believe that the culprit lies within the const char[].
P.S: GCC does calls to strlen in the first example but Clang doesn't.
Link to the Compiler Explorer example: https://godbolt.org/z/Exqs6G
Thank you for your time!
sorry for the wall of code
The overarching difference here is that the first version is branchless.
16 isn’t the length of any string here (the longer one, with NUL, is only 15 bytes long); it’s an offset into the return object (whose address is passed in RDI to support RVO), used to indicate that the small-string optimization is in use (note the lack of allocation). The lengths are 5 or 5+1+8 stored in R14, which is stored in the std::string as well as passed to memcpy (along with a pointer chosen by CMOVNE) to load the actual string bytes.
The other version has an obvious branch (although part of the std::string construction has been hoisted above it) and actually does have 5 and 14 explicitly, but is obfuscated by the fact that the string bytes have been included as immediate values (expressed as integers) of various sizes.
As for why these three equivalent functions produce two different versions of the generated code, all I can offer is that optimizers are iterative and heuristic algorithms; they don’t reliably find the same “best” assembly independently of their starting point.
The first version returns a string object which is initialized with a not-constant expression yielding one of the string literals, so the constructor is run as for any other variable string object, thus the memcpy to do the initialization.
The other variants return either one string object initialized with a string literal or another string object initialized with another string literal, both of which can be optimized to a string object constructed from a constant expression where no memcpy is needed.
So the real answer is: the first version operates the ?: operator on char[] expressions before initializing the objects and the other versions on the string objects already being initialized.
It does not matter whether one of the versions is branchless.

C++ std::string initialization better performance (assembly)

I was playing with www.godbolt.org to check what code generates better assembly code, and I can't understand why this two different approaches generate different results (in assembly commands).
The first approach is to declare a string, and then later set a value:
#include <string>
int foo() {
std::string a;
a = "abcdef";
return a.size();
}
Which, in my gcc 7.4 (-O3) outputs:
.LC0:
.string "abcdef"
foo():
push rbp
mov r8d, 6
mov ecx, OFFSET FLAT:.LC0
xor edx, edx
push rbx
xor esi, esi
sub rsp, 40
lea rbx, [rsp+16]
mov rdi, rsp
mov BYTE PTR [rsp+16], 0
mov QWORD PTR [rsp], rbx
mov QWORD PTR [rsp+8], 0
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long)
mov rdi, QWORD PTR [rsp]
mov rbp, QWORD PTR [rsp+8]
cmp rdi, rbx
je .L1
call operator delete(void*)
.L1:
add rsp, 40
mov eax, ebp
pop rbx
pop rbp
ret
mov rbp, rax
jmp .L3
foo() [clone .cold]:
.L3:
mov rdi, QWORD PTR [rsp]
cmp rdi, rbx
je .L4
call operator delete(void*)
.L4:
mov rdi, rbp
call _Unwind_Resume
So, I imagined that if I initialize the string in the declaration, the output assembly would be shorter:
int bar() {
std::string a {"abcdef"};
return a.size();
}
And indeed it is:
bar():
mov eax, 6
ret
Why this huge difference? What prevents gcc to optimize the first version similar to the second?
godbolt link
This is just a guess:
operator= has a strong exception guarantee; which means:
If an exception is thrown for any reason, this function has no effect (strong exception guarantee).
(since C++11)
(source)
So while the constructor can leave the object in any condition it likes, operator= needs to make sure that the object is the same as before; I suspect that's why the call to operator delete is there (to clean up potentially allocated memory).

Should I create object on a free store only for one call?

Assuming that code is located inside if block, what are differences between creating object in a free store and doing only one call on it:
auto a = aFactory.createA();
int result = a->foo(5);
and making call directly on returned pointer?
int result = aFactory.createA()->foo(5);
Is there any difference in performance? Which way is better?
#include <iostream>
#include <memory>
class A
{
public:
int foo(int a){return a+3;}
};
class AFactory
{
public:
std::unique_ptr<A> createA(){return std::make_unique<A>();}
};
int main()
{
AFactory aFactory;
bool condition = true;
if(condition)
{
auto a = aFactory.createA();
int result = a->foo(5);
}
}
Look here. There is no difference in code generated for both versions even with optimisations disabled.
Using gcc7.1 with -std=c++1z -O0
auto a = aFactory.createA();
int result = a->foo(5);
is compiled to:
lea rax, [rbp-24]
lea rdx, [rbp-9]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
and int result = aFactory.createA()->foo(5); to:
lea rax, [rbp-16]
lea rdx, [rbp-17]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
So they are pretty much identical.
This outcome is understandable when you realise the only difference between the two versions is that in the first one we assign the name to our object, while in the second we work with an unnamed one. Other then that, they are both created on heap and used the same way. And since variable name means nothing to the compiler - it is only relevant for code-readers - it treats the two as if they were identical.
In your simple case it will not make a difference, because the (main) function ends right after creating and using a.
If some more lines of code would follow, the destruction of the a object would happen at the end of the if block in main, while in the one line case it becomes destructed at the end of that single line. However, it would be bad design if the destructor of a more sophisticated class A would make a difference on that.
Due to compiler optimizations performance questions should always be answered by testing with a profiler on the concrete code.

Why does gcc and clang produce very differnt code for member function template parameters?

I am trying to understand what is going on when a member function pointer is used as template parameter. I always thought that function pointers (or member function pointers) are a run-time concept, so I was wondering what happens when they are used as template parameters. For this reason I took a look a the output produced by this code:
struct Foo { void foo(int i){ } };
template <typename T,void (T::*F)(int)>
void callFunc(T& t){ (t.*F)(1); }
void callF(Foo& f){ f.foo(1);}
int main(){
Foo f;
callF(f);
callFunc<Foo,&Foo::foo>(f);
}
where callF is for comparison. gcc 6.2 produces the exact same output for both functions:
callF(Foo&): // void callFunc<Foo, &Foo::foo>(Foo&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov esi, 1
mov rdi, rax
call Foo::foo(int)
nop
leave
ret
while clang 3.9 produces almost the same output for callF():
callF(Foo&): # #callF(Foo&)
push rbp
mov rbp, rsp
sub rsp, 16
mov esi, 1
mov qword ptr [rbp - 8], rdi
mov rdi, qword ptr [rbp - 8]
call Foo::foo(int)
add rsp, 16
pop rbp
ret
but very different output for the template instantiation:
void callFunc<Foo, &Foo::foo>(Foo&): # #void callFunc<Foo, &Foo::foo>(Foo&)
push rbp
mov rbp, rsp
sub rsp, 32
xor eax, eax
mov cl, al
mov qword ptr [rbp - 8], rdi
mov rdi, qword ptr [rbp - 8]
test cl, 1
mov qword ptr [rbp - 16], rdi # 8-byte Spill
jne .LBB3_1
jmp .LBB3_2
.LBB3_1:
movabs rax, Foo::foo(int)
sub rax, 1
mov rcx, qword ptr [rbp - 16] # 8-byte Reload
mov rdx, qword ptr [rcx]
mov rax, qword ptr [rdx + rax]
mov qword ptr [rbp - 24], rax # 8-byte Spill
jmp .LBB3_3
.LBB3_2:
movabs rax, Foo::foo(int)
mov qword ptr [rbp - 24], rax # 8-byte Spill
jmp .LBB3_3
.LBB3_3:
mov rax, qword ptr [rbp - 24] # 8-byte Reload
mov esi, 1
mov rdi, qword ptr [rbp - 16] # 8-byte Reload
call rax
add rsp, 32
pop rbp
ret
Why is that? Is gcc taking some (possibly non-standard) shortcut?
gcc was able to figure out what the template was doing, and generated the simplest code possible. clang didn't. A compiler is permitted to perform any optimization as long as the observable results are compliant with the C++ specification. If optimizing away an intermediate function pointer, so be it. Nothing else in the code references the temporary function pointer, so it can be optimized away completely, and the whole thing replaced with a simple function call.
gcc and clang are different compilers, written by different people, with different approaches and algorithms for compiling C++.
It is natural, and expected to see different results from different compilers. In this case, gcc was able to figure things out better than clang. I'm sure there are other situations where clang will be able to figure things out better than gcc.
This test was done without any optimizations requested.
One compiler generated more verbose unoptimized code.
Unoptimized code is, quite simply, uninteresting. It is intended to be correct and easy to debug and derive directly from some intermediate representation that is easy to optimize.
The details of optimized code are what matter, barring a ridiculous and widespread slowdown that makes debugging painful.
There is nothing of interest to see or explain here.