Why does unique_ptr instantiation compile to larger binary than raw pointer?

Why does unique_ptr instantiation compile to larger binary than raw pointer? - c++

I was always under the impression that a std::unique_ptr had no overhead compared to using a raw pointer. However, compiling the following code
#include <memory>
void raw_pointer() {
int* p = new int[100];
delete[] p;
}
void smart_pointer() {
auto p = std::make_unique<int[]>(100);
}
with g++ -std=c++14 -O3 produces the following assembly:
raw_pointer():
sub rsp, 8
mov edi, 400
call operator new[](unsigned long)
add rsp, 8
mov rdi, rax
jmp operator delete[](void*)
smart_pointer():
sub rsp, 8
mov edi, 400
call operator new[](unsigned long)
lea rdi, [rax+8]
mov rcx, rax
mov QWORD PTR [rax], 0
mov QWORD PTR [rax+392], 0
mov rdx, rax
xor eax, eax
and rdi, -8
sub rcx, rdi
add ecx, 400
shr ecx, 3
rep stosq
mov rdi, rdx
add rsp, 8
jmp operator delete[](void*)
Why is the output for smart_pointer() almost three times as large as raw_pointer()?

Because std::make_unique<int[]>(100) performs value initialization while new int[100] performs default initialization - In the first case, elements are 0-initialized (for int), while in the second case elements are left uninitialized. Try:
int *p = new int[100]();
And you'll get the same output as with the std::unique_ptr.
See this for instance, which states that std::make_unique<int[]>(100) is equivalent to:
std::unique_ptr<T>(new int[100]())
If you want a non-initialized array with std::unique_ptr, you could use1:
std::unique_ptr<int[]>(new int[100]);
1 As mentioned by #Ruslan in the comments, be aware of the difference between std::make_unique() and std::unique_ptr() - See Differences between std::make_unique and std::unique_ptr.

Related

Will delete the pointer to pointer cause the memory leak?

I have a question about C-style string symbolic constants and dynamically allocating arrays.
const char** name = new const char* { "Alan" };
delete name;
when I try to delete name after new'ing a piece of memory, the compiler suggest to me to use delete instead of delete[]. I understand name only stores the address of the pointer to the only-read string.
However, if I only delete the pointer to pointer (which is name), will the string itself cause a memory leak?

As the comments above indicate, you don't need to manage the memory that "Alan" exists in.
Let's see what that looks like in practice.
I made a modified version of your code:
#include <iostream>
void test() {
const char** name;
name = new const char* { "Alan\n" };
delete name;
}
int main()
{
test();
}
and then I popped it into godbolt and it shows what's happening under the hood. (excerpts copied below)
In both clang and gcc, the memory that stores "Alan\n" is in static memory so it always exists. This is how it creates no memory leak even though you never touch it again after mentioning it. The value of the pointer to "Alan\n" is just the position in the program's memory, offset .L.str or OFFSET FLAT:.LC0.
clang:
test(): # #test()
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 8
call operator new(unsigned long)
mov rcx, rax
movabs rdx, offset .L.str
mov qword ptr [rax], rdx
mov qword ptr [rbp - 8], rcx
mov rax, qword ptr [rbp - 8]
cmp rax, 0
mov qword ptr [rbp - 16], rax # 8-byte Spill
je .LBB1_2
mov rax, qword ptr [rbp - 16] # 8-byte Reload
mov rdi, rax
call operator delete(void*)
.L.str:
.asciz "Alan\n"
gcc:
.LC0:
.string "Alan\n"
test():
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 8
call operator new(unsigned long)
mov QWORD PTR [rax], OFFSET FLAT:.LC0
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
test rax, rax
je .L3
mov esi, 8
mov rdi, rax
call operator delete(void*, unsigned long)

Why does using the ternary operator to return a string generate considerably different code from returning in an equivalent if/else block?

I was playing with the Compiler Explorer and I stumbled upon an interesting behavior with the ternary operator when using something like this:
std::string get_string(bool b)
{
return b ? "Hello" : "Stack-overflow";
}
The compiler generated code for this (clang trunk, with -O3) is this:
get_string[abi:cxx11](bool): # #get_string[abi:cxx11](bool)
push r15
push r14
push rbx
mov rbx, rdi
mov ecx, offset .L.str
mov eax, offset .L.str.1
test esi, esi
cmovne rax, rcx
add rdi, 16 #< Why is the compiler storing the length of the string
mov qword ptr [rbx], rdi
xor sil, 1
movzx ecx, sil
lea r15, [rcx + 8*rcx]
lea r14, [rcx + 8*rcx]
add r14, 5 #< I also think this is the length of "Hello" (but not sure)
mov rsi, rax
mov rdx, r14
call memcpy #< Why is there a call to memcpy
mov qword ptr [rbx + 8], r14
mov byte ptr [rbx + r15 + 21], 0
mov rax, rbx
pop rbx
pop r14
pop r15
ret
.L.str:
.asciz "Hello"
.L.str.1:
.asciz "Stack-Overflow"
However, the compiler generated code for the following snippet is considerably smaller and with no calls to memcpy, and does not care about knowing the length of both strings at the same time. There are 2 different labels that it jumps to
std::string better_string(bool b)
{
if (b)
{
return "Hello";
}
else
{
return "Stack-Overflow";
}
}
The compiler generated code for the above snippet (clang trunk with -O3) is this:
better_string[abi:cxx11](bool): # #better_string[abi:cxx11](bool)
mov rax, rdi
lea rcx, [rdi + 16]
mov qword ptr [rdi], rcx
test sil, sil
je .LBB0_2
mov dword ptr [rcx], 1819043144
mov word ptr [rcx + 4], 111
mov ecx, 5
mov qword ptr [rax + 8], rcx
ret
.LBB0_2:
movabs rdx, 8606216600190023247
mov qword ptr [rcx + 6], rdx
movabs rdx, 8525082558887720019
mov qword ptr [rcx], rdx
mov byte ptr [rax + 30], 0
mov ecx, 14
mov qword ptr [rax + 8], rcx
ret
The same result is when I use the ternary operator with:
std::string get_string(bool b)
{
return b ? std::string("Hello") : std::string("Stack-Overflow");
}
I would like to know why the ternary operator in the first example generates that compiler code. I believe that the culprit lies within the const char[].
P.S: GCC does calls to strlen in the first example but Clang doesn't.
Link to the Compiler Explorer example: https://godbolt.org/z/Exqs6G
Thank you for your time!
sorry for the wall of code

The overarching difference here is that the first version is branchless.
16 isn’t the length of any string here (the longer one, with NUL, is only 15 bytes long); it’s an offset into the return object (whose address is passed in RDI to support RVO), used to indicate that the small-string optimization is in use (note the lack of allocation). The lengths are 5 or 5+1+8 stored in R14, which is stored in the std::string as well as passed to memcpy (along with a pointer chosen by CMOVNE) to load the actual string bytes.
The other version has an obvious branch (although part of the std::string construction has been hoisted above it) and actually does have 5 and 14 explicitly, but is obfuscated by the fact that the string bytes have been included as immediate values (expressed as integers) of various sizes.
As for why these three equivalent functions produce two different versions of the generated code, all I can offer is that optimizers are iterative and heuristic algorithms; they don’t reliably find the same “best” assembly independently of their starting point.

The first version returns a string object which is initialized with a not-constant expression yielding one of the string literals, so the constructor is run as for any other variable string object, thus the memcpy to do the initialization.
The other variants return either one string object initialized with a string literal or another string object initialized with another string literal, both of which can be optimized to a string object constructed from a constant expression where no memcpy is needed.
So the real answer is: the first version operates the ?: operator on char[] expressions before initializing the objects and the other versions on the string objects already being initialized.
It does not matter whether one of the versions is branchless.

C++ std::string initialization better performance (assembly)

I was playing with www.godbolt.org to check what code generates better assembly code, and I can't understand why this two different approaches generate different results (in assembly commands).
The first approach is to declare a string, and then later set a value:
#include <string>
int foo() {
std::string a;
a = "abcdef";
return a.size();
}
Which, in my gcc 7.4 (-O3) outputs:
.LC0:
.string "abcdef"
foo():
push rbp
mov r8d, 6
mov ecx, OFFSET FLAT:.LC0
xor edx, edx
push rbx
xor esi, esi
sub rsp, 40
lea rbx, [rsp+16]
mov rdi, rsp
mov BYTE PTR [rsp+16], 0
mov QWORD PTR [rsp], rbx
mov QWORD PTR [rsp+8], 0
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long)
mov rdi, QWORD PTR [rsp]
mov rbp, QWORD PTR [rsp+8]
cmp rdi, rbx
je .L1
call operator delete(void*)
.L1:
add rsp, 40
mov eax, ebp
pop rbx
pop rbp
ret
mov rbp, rax
jmp .L3
foo() [clone .cold]:
.L3:
mov rdi, QWORD PTR [rsp]
cmp rdi, rbx
je .L4
call operator delete(void*)
.L4:
mov rdi, rbp
call _Unwind_Resume
So, I imagined that if I initialize the string in the declaration, the output assembly would be shorter:
int bar() {
std::string a {"abcdef"};
return a.size();
}
And indeed it is:
bar():
mov eax, 6
ret
Why this huge difference? What prevents gcc to optimize the first version similar to the second?
godbolt link

This is just a guess:
operator= has a strong exception guarantee; which means:
If an exception is thrown for any reason, this function has no effect (strong exception guarantee).
(since C++11)
(source)
So while the constructor can leave the object in any condition it likes, operator= needs to make sure that the object is the same as before; I suspect that's why the call to operator delete is there (to clean up potentially allocated memory).

Should I create object on a free store only for one call?

Assuming that code is located inside if block, what are differences between creating object in a free store and doing only one call on it:
auto a = aFactory.createA();
int result = a->foo(5);
and making call directly on returned pointer?
int result = aFactory.createA()->foo(5);
Is there any difference in performance? Which way is better?
#include <iostream>
#include <memory>
class A
{
public:
int foo(int a){return a+3;}
};
class AFactory
{
public:
std::unique_ptr<A> createA(){return std::make_unique<A>();}
};
int main()
{
AFactory aFactory;
bool condition = true;
if(condition)
{
auto a = aFactory.createA();
int result = a->foo(5);
}
}

Look here. There is no difference in code generated for both versions even with optimisations disabled.
Using gcc7.1 with -std=c++1z -O0
auto a = aFactory.createA();
int result = a->foo(5);
is compiled to:
lea rax, [rbp-24]
lea rdx, [rbp-9]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-24]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
and int result = aFactory.createA()->foo(5); to:
lea rax, [rbp-16]
lea rdx, [rbp-17]
mov rsi, rdx
mov rdi, rax
call AFactory::createA()
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::operator->() const
mov esi, 5
mov rdi, rax
call A::foo(int)
mov DWORD PTR [rbp-8], eax
lea rax, [rbp-16]
mov rdi, rax
call std::unique_ptr<A, std::default_delete<A> >::~unique_ptr()
So they are pretty much identical.
This outcome is understandable when you realise the only difference between the two versions is that in the first one we assign the name to our object, while in the second we work with an unnamed one. Other then that, they are both created on heap and used the same way. And since variable name means nothing to the compiler - it is only relevant for code-readers - it treats the two as if they were identical.

In your simple case it will not make a difference, because the (main) function ends right after creating and using a.
If some more lines of code would follow, the destruction of the a object would happen at the end of the if block in main, while in the one line case it becomes destructed at the end of that single line. However, it would be bad design if the destructor of a more sophisticated class A would make a difference on that.
Due to compiler optimizations performance questions should always be answered by testing with a profiler on the concrete code.

Optimization of raw new[]/delete[] vs std::vector

Let's mess around with very basic dynamically allocated memory. We take a vector of 3, set its elements and return the sum of the vector.
In the first test case I used a raw pointer with new[]/delete[]. In the second I used std::vector:
#include <vector>
int main()
{
//int *v = new int[3]; // (1)
auto v = std::vector<int>(3); // (2)
for (int i = 0; i < 3; ++i)
v[i] = i + 1;
int s = 0;
for (int i = 0; i < 3; ++i)
s += v[i];
//delete[] v; // (1)
return s;
}
Assembly of (1) (new[]/delete[])
main: # #main
mov eax, 6
ret
Assembly of (2) (std::vector)
main: # #main
push rax
mov edi, 12
call operator new(unsigned long)
mov qword ptr [rax], 0
movabs rcx, 8589934593
mov qword ptr [rax], rcx
mov dword ptr [rax + 8], 3
test rax, rax
je .LBB0_2
mov rdi, rax
call operator delete(void*)
.LBB0_2: # %std::vector<int, std::allocator<int> >::~vector() [clone .exit]
mov eax, 6
pop rdx
ret
Both outputs taken from https://gcc.godbolt.org/ with -std=c++14 -O3
In both versions the returned value is computed at compile time so we see just mov eax, 6; ret.
With the raw new[]/delete[] the dynamic allocation was completely removed. With std::vector however, the memory is allocated, set and freed.
This happens even with an unused variable auto v = std::vector<int>(3): call to new, memory is set and then call to delete.
I realize this is most likely a near impossible answer to give, but maybe someone has some insights and some interesting answers might pop out.
What are the contributing factors that don't allow compiler optimizations to remove the memory allocation in the std::vector case, like in the raw memory allocation case?

When using a pointer to a dynamically allocated array (directly using new[] and delete[]), the compiler optimized away the calls to operator new and operator delete even though they have observable side effects. This optimization is allowed by the C++ standard section 5.3.4 paragraph 10:
An implementation is allowed to omit a call to a replaceable global
allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage
is instead provided by the implementation or...
I'll show the rest of the sentence, which is crucial, at the end.
This optimization is relatively new because it was first allowed in C++14 (proposal N3664). Clang supported it since 3.4. The latest version of gcc, namely 5.3.0, doesn't take advantage of this relaxation of the as-if rule. It produces the following code:
main:
sub rsp, 8
mov edi, 12
call operator new[](unsigned long)
mov DWORD PTR [rax], 1
mov DWORD PTR [rax+4], 2
mov rdi, rax
mov DWORD PTR [rax+8], 3
call operator delete[](void*)
mov eax, 6
add rsp, 8
ret
MSVC 2013 also doesn't support this optimization. It produces the following code:
main:
sub rsp,28h
mov ecx,0Ch
call operator new[] ()
mov rcx,rax
mov dword ptr [rax],1
mov dword ptr [rax+4],2
mov dword ptr [rax+8],3
call operator delete[] ()
mov eax,6
add rsp,28h
ret
I currently don't have access to MSVC 2015 Update 1 and therefore I don't know whether it supports this optimization or not.
Finally, here is the assembly code generated by icc 13.0.1:
main:
push rbp
mov rbp, rsp
and rsp, -128
sub rsp, 128
mov edi, 3
call __intel_new_proc_init
stmxcsr DWORD PTR [rsp]
mov edi, 12
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
call operator new[](unsigned long)
mov rdi, rax
mov DWORD PTR [rax], 1
mov DWORD PTR [4+rax], 2
mov DWORD PTR [8+rax], 3
call operator delete[](void*)
mov eax, 6
mov rsp, rbp
pop rbp
ret
Clearly, it doesn't support this optimization. I don't have access to the latest version of icc, namely 16.0.
All of these code snippets have been produced with optimizations enabled.
When using std::vector, all of these compilers didn't optimize away the allocation. When a compiler doesn't perform an optimization, it's either because it cannot for some reason or it's just not yet supported.
What are the contributing factors that don't allow compiler
optimizations to remove the memory allocation in the std::vector case,
like in the raw memory allocation case?
The compiler didn't perform the optimization because it's not allowed to. To see this, let's see the rest of the sentence of paragraph 10 from 5.3.4:
An implementation is allowed to omit a call to a replaceable global
allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage
is instead provided by the implementation or provided by extending the
allocation of another new-expression.
What this is saying is that you can omit a call to a replaceable global allocation function only if it originated from a new-expression. A new-expression is defined in paragraph 1 of the same section.
The following expression
new int[3]
is a new-expression and therefore the compiler is allowed to optimize away the associated allocation function call.
On the other hand, the following expression:
::operator new(12)
is NOT a new-expression (see 5.3.4 paragraph 1). This is just a function call expression. In other words, this is treated as a typical function call. This function cannot be optimized away because its imported from another shared library (even if you linked the runtime statically, the function itself calls another imported function).
The default allocator used by std::vector allocates memory using ::operator new and therefore the compiler is not allowed to optimize it away.
Let's test this. Here's the code:
int main()
{
int *v = (int*)::operator new(12);
for (int i = 0; i < 3; ++i)
v[i] = i + 1;
int s = 0;
for (int i = 0; i < 3; ++i)
s += v[i];
delete v;
return s;
}
By compiling using Clang 3.7, we get the following assembly code:
main: # #main
push rax
mov edi, 12
call operator new(unsigned long)
movabs rcx, 8589934593
mov qword ptr [rax], rcx
mov dword ptr [rax + 8], 3
test rax, rax
je .LBB0_2
mov rdi, rax
call operator delete(void*)
.LBB0_2:
mov eax, 6
pop rdx
ret
This is exactly the same as assembly code generated when using std::vector except for mov qword ptr [rax], 0 which comes from the constructor of std::vector (the compiler should have removed it but failed to do so because of a flaw in its optimization algorithms).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does unique_ptr instantiation compile to larger binary than raw pointer? - c++

Related

Will delete the pointer to pointer cause the memory leak?

Why does using the ternary operator to return a string generate considerably different code from returning in an equivalent if/else block?

C++ std::string initialization better performance (assembly)

Should I create object on a free store only for one call?

Optimization of raw new[]/delete[] vs std::vector

Categories

Resources