I am curious why the following piece of code:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNO";
}
when compiled with -O3 yields the following code:
main: # #main
xor eax, eax
ret
(I perfectly understand that there is no need for the unused a so the compiler can entirely omit it from the generated code)
However the following program:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNOP"; // <-- !!! One Extra P
}
yields:
main: # #main
push rbx
sub rsp, 48
lea rbx, [rsp + 32]
mov qword ptr [rsp + 16], rbx
mov qword ptr [rsp + 8], 16
lea rdi, [rsp + 16]
lea rsi, [rsp + 8]
xor edx, edx
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)
mov qword ptr [rsp + 16], rax
mov rcx, qword ptr [rsp + 8]
mov qword ptr [rsp + 32], rcx
movups xmm0, xmmword ptr [rip + .L.str]
movups xmmword ptr [rax], xmm0
mov qword ptr [rsp + 24], rcx
mov rax, qword ptr [rsp + 16]
mov byte ptr [rax + rcx], 0
mov rdi, qword ptr [rsp + 16]
cmp rdi, rbx
je .LBB0_3
call operator delete(void*)
.LBB0_3:
xor eax, eax
add rsp, 48
pop rbx
ret
mov rdi, rax
call _Unwind_Resume
.L.str:
.asciz "ABCDEFGHIJKLMNOP"
when compiled with the same -O3. I don't understand why it does not recognize that the a is still unused, regardless that the string is one byte longer.
This question is relevant to gcc 9.1 and clang 8.0, (online: https://gcc.godbolt.org/z/p1Z8Ns) because other compilers in my observation either entirely drop the unused variable (ellcc) or generate code for it regardless the length of the string.
This is due to the small string optimization. When the string data is less than or equal 16 characters, including the null terminator, it is stored in a buffer local to the std::string object itself. Otherwise, it allocates memory on the heap and stores the data over there.
The first string "ABCDEFGHIJKLMNO" plus the null terminator is exactly of size 16. Adding "P" makes it exceed the buffer, hence new is being called internally, inevitably leading to a system call. The compiler can optimize something away if it's possible to ensure that there are no side effects. A system call probably makes it impossible to do this - by constrast, changing a buffer local to the object under construction allows for such a side effect analysis.
Tracing the local buffer in libstdc++, version 9.1, reveals these parts of bits/basic_string.h:
template<typename _CharT, typename _Traits, typename _Alloc>
class basic_string
{
// ...
enum { _S_local_capacity = 15 / sizeof(_CharT) };
union
{
_CharT _M_local_buf[_S_local_capacity + 1];
size_type _M_allocated_capacity;
};
// ...
};
which lets you spot the local buffer size _S_local_capacity and the local buffer itself (_M_local_buf). When the constructor triggers basic_string::_M_construct being called, you have in bits/basic_string.tcc:
void _M_construct(_InIterator __beg, _InIterator __end, ...)
{
size_type __len = 0;
size_type __capacity = size_type(_S_local_capacity);
while (__beg != __end && __len < __capacity)
{
_M_data()[__len++] = *__beg;
++__beg;
}
where the local buffer is filled with its content. Right after this part, we get to the branch where the local capacity is exhausted - new storage is allocated (through the allocate in M_create), the local buffer is copied into the new storage and filled with the rest of the initializing argument:
while (__beg != __end)
{
if (__len == __capacity)
{
// Allocate more space.
__capacity = __len + 1;
pointer __another = _M_create(__capacity, __len);
this->_S_copy(__another, _M_data(), __len);
_M_dispose();
_M_data(__another);
_M_capacity(__capacity);
}
_M_data()[__len++] = *__beg;
++__beg;
}
As a side note, small string optimization is quite a topic on its own. To get a feeling for how tweaking individual bits can make a difference at large scale, I'd recommend this talk. It also mentions how the std::string implementation that ships with gcc (libstdc++) works and changed during the past to match newer versions of the standard.
I was surprised the compiler saw through a std::string constructor/destructor pair until I saw your second example. It didn't. What you're seeing here is small string optimization and corresponding optimizations from the compiler around that.
Small string optimizations are when the std::string object itself is big enough to hold the contents of the string, a size and possibly a discriminating bit used to indicate whether the string is operating in small or big string mode. In such a case, no dynamic allocations occur and the string is stored in the std::string object itself.
Compilers are really bad at eliding unneeded allocations and deallocations, they are treated almost as if having side effects and are thus impossible to elide. When you go over the small string optimization threshold, dynamic allocations occur and the result is what you see.
As an example
void foo() {
delete new int;
}
is the simplest, dumbest allocation/deallocation pair possible, yet gcc emits this assembly even under O3
sub rsp, 8
mov edi, 4
call operator new(unsigned long)
mov esi, 4
add rsp, 8
mov rdi, rax
jmp operator delete(void*, unsigned long)
While the accepted answer is valid, since C++14 it's actually the case that new and delete calls can be optimized away. See this arcane wording on cppreference:
New-expressions are allowed to elide ... allocations made through replaceable allocation functions. In case of elision, the storage may be provided by the compiler without making the call to an allocation function (this also permits optimizing out unused new-expression).
...
Note that this optimization is only permitted when new-expressions are
used, not any other methods to call a replaceable allocation function:
delete[] new int[10]; can be optimized out, but operator
delete(operator new(10)); cannot.
This actually allows compilers to completely drop your local std::string even if it's very long. In fact - clang++ with libc++ already does this (GodBolt), since libc++ uses built-ins __new and __delete in its implementation of std::string - that's "storage provided by the compiler". Thus, we get:
main():
xor eax, eax
ret
with basically any-length unused string.
GCC doesn't do but I've recently opened bug reports about this; see this SO answer for links.
Related
I was playing with the Compiler Explorer and I stumbled upon an interesting behavior with the ternary operator when using something like this:
std::string get_string(bool b)
{
return b ? "Hello" : "Stack-overflow";
}
The compiler generated code for this (clang trunk, with -O3) is this:
get_string[abi:cxx11](bool): # #get_string[abi:cxx11](bool)
push r15
push r14
push rbx
mov rbx, rdi
mov ecx, offset .L.str
mov eax, offset .L.str.1
test esi, esi
cmovne rax, rcx
add rdi, 16 #< Why is the compiler storing the length of the string
mov qword ptr [rbx], rdi
xor sil, 1
movzx ecx, sil
lea r15, [rcx + 8*rcx]
lea r14, [rcx + 8*rcx]
add r14, 5 #< I also think this is the length of "Hello" (but not sure)
mov rsi, rax
mov rdx, r14
call memcpy #< Why is there a call to memcpy
mov qword ptr [rbx + 8], r14
mov byte ptr [rbx + r15 + 21], 0
mov rax, rbx
pop rbx
pop r14
pop r15
ret
.L.str:
.asciz "Hello"
.L.str.1:
.asciz "Stack-Overflow"
However, the compiler generated code for the following snippet is considerably smaller and with no calls to memcpy, and does not care about knowing the length of both strings at the same time. There are 2 different labels that it jumps to
std::string better_string(bool b)
{
if (b)
{
return "Hello";
}
else
{
return "Stack-Overflow";
}
}
The compiler generated code for the above snippet (clang trunk with -O3) is this:
better_string[abi:cxx11](bool): # #better_string[abi:cxx11](bool)
mov rax, rdi
lea rcx, [rdi + 16]
mov qword ptr [rdi], rcx
test sil, sil
je .LBB0_2
mov dword ptr [rcx], 1819043144
mov word ptr [rcx + 4], 111
mov ecx, 5
mov qword ptr [rax + 8], rcx
ret
.LBB0_2:
movabs rdx, 8606216600190023247
mov qword ptr [rcx + 6], rdx
movabs rdx, 8525082558887720019
mov qword ptr [rcx], rdx
mov byte ptr [rax + 30], 0
mov ecx, 14
mov qword ptr [rax + 8], rcx
ret
The same result is when I use the ternary operator with:
std::string get_string(bool b)
{
return b ? std::string("Hello") : std::string("Stack-Overflow");
}
I would like to know why the ternary operator in the first example generates that compiler code. I believe that the culprit lies within the const char[].
P.S: GCC does calls to strlen in the first example but Clang doesn't.
Link to the Compiler Explorer example: https://godbolt.org/z/Exqs6G
Thank you for your time!
sorry for the wall of code
The overarching difference here is that the first version is branchless.
16 isn’t the length of any string here (the longer one, with NUL, is only 15 bytes long); it’s an offset into the return object (whose address is passed in RDI to support RVO), used to indicate that the small-string optimization is in use (note the lack of allocation). The lengths are 5 or 5+1+8 stored in R14, which is stored in the std::string as well as passed to memcpy (along with a pointer chosen by CMOVNE) to load the actual string bytes.
The other version has an obvious branch (although part of the std::string construction has been hoisted above it) and actually does have 5 and 14 explicitly, but is obfuscated by the fact that the string bytes have been included as immediate values (expressed as integers) of various sizes.
As for why these three equivalent functions produce two different versions of the generated code, all I can offer is that optimizers are iterative and heuristic algorithms; they don’t reliably find the same “best” assembly independently of their starting point.
The first version returns a string object which is initialized with a not-constant expression yielding one of the string literals, so the constructor is run as for any other variable string object, thus the memcpy to do the initialization.
The other variants return either one string object initialized with a string literal or another string object initialized with another string literal, both of which can be optimized to a string object constructed from a constant expression where no memcpy is needed.
So the real answer is: the first version operates the ?: operator on char[] expressions before initializing the objects and the other versions on the string objects already being initialized.
It does not matter whether one of the versions is branchless.
Consider the following code:
#include <utility>
#include <string>
int bar() {
std::pair<int, std::string> p {
123, "Hey... no small-string optimization for me please!" };
return p.first;
}
(simplified thanks to #Jarod42 :-) ...)
I expect the function to be implemented as simply:
bar():
mov eax, 123
ret
but instead, the implementation calls operator new(), constructs an std::string with my literal, then calls operator delete(). At least - that's what gcc 9 and clang 9 do (GodBolt). Here's the clang output:
bar(): # #bar()
push rbx
sub rsp, 48
mov dword ptr [rsp + 8], 123
lea rax, [rsp + 32]
mov qword ptr [rsp + 16], rax
mov edi, 51
call operator new(unsigned long)
mov qword ptr [rsp + 16], rax
mov qword ptr [rsp + 32], 50
movups xmm0, xmmword ptr [rip + .L.str]
movups xmmword ptr [rax], xmm0
movups xmm0, xmmword ptr [rip + .L.str+16]
movups xmmword ptr [rax + 16], xmm0
movups xmm0, xmmword ptr [rip + .L.str+32]
movups xmmword ptr [rax + 32], xmm0
mov word ptr [rax + 48], 8549
mov qword ptr [rsp + 24], 50
mov byte ptr [rax + 50], 0
mov ebx, dword ptr [rsp + 8]
mov rdi, rax
call operator delete(void*)
mov eax, ebx
add rsp, 48
pop rbx
ret
.L.str:
.asciz "Hey... no small-string optimization for me please!"
My question is: Clearly, the compiler has full knowledge of everything going on inside bar(). Why is it not "eliding"/optimizing the string away? More specifically:
At the basic level there's the code between then new() and delete(), which AFAICT the compiler knows results in nothing useful.
Secondarily, the new() and delete() calls themselves. After all, small-string-optimization is allowed by the standard AFAIK, so even though clang/gcc hasn't chosen to use that - it could have; meaning that it's not actually required to call new() or delete() there.
I'm particularly interested in what part of this is directly due to the language standard, and what part is compiler non-optimality.
Nothing in your code represents "elision" as that term is commonly used in a C++ context. The compiler is not permitted to remove anything from that code on the grounds of "elision".
The only grounds a compiler has to remove the creation of that string is on the basis of the "as if" rule. That is, is the behavior of the string creation/destruction visible to the user and therefore not able to be removed?
Since it uses std::allocator and the standard character traits, the basic_string construction and destruction itself is not being overridden by the user. So there is some basis for the idea that the string's creation is not a visible side-effect of the function call and thus could be removed under the "as if" rule.
However, because std::allocator::allocate is specified to call ::operator new, and operator new is globally replaceable, it is reasonable to argue that this is a visible side effect of the construction of such a string. And therefore, the compiler cannot remove it under the "as if" rule.
If the compiler knows that you have not replaced operator new, then it can in theory optimize the string away.
That doesn't mean that any particular compiler will do so.
Following the discussion in various answers and comments here, I have now filed the following bugs against GCC and LLVM regarding this issue:
GCC bug 94293: [missed optimization] new+delete of unused local string not removed
Minimal testcase (GodBolt):
void foo() {
int *p = new int[1];
*p = 42;
delete[] p;
}
GCC bug 94294: [missed optimization] Useless statements populating local string not removed
Minimal testcase (GodBolt):
void foo() {
std::string s { "This is not a small string" };
}
LLVM bug 45287: [missed optimization] failure to drop unused libstdc++ std::string.
Minimal testcase (GodBolt):
void foo() {
std::string s { "This is not a small string" };
}
Thanks goes to: #JeffGarret, #NicolBolas, #Jarod42, Marc Glisse .
Update, August 2021: With recent versions of clang++, g++ and libstc++, all of these minimal testcases eschew memory allocation as one would expect. clang++ also has this behavior for OP's program in the question, but GCC still allocates and deallocates.
The question is can the program
int bar() {
std::pair<int, std::string> p {
123, "Hey... no small-string optimization for me please!" };
return p.first;
}
be validly be optimized to
int bar() {
return 123;
}
tldr, yes, I think.
And clang does with libc++: godbolt
About std::string, the standard says string.require/3
Every object of type basic_string uses an object of type Allocator to allocate and free storage for the contained charT objects as needed.
"as needed". std::string is allowed to decide when to use the allocator (which is I believe the justification for SSO being valid). Its member functions do not mandate an allocation. Therefore the allocation may be elided.
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
Let's assume I pass a new object to a function like this:
loadContainer->addControlView( new BmpView( BMP_PICTURE ) );
Now, I want to change a specific characteristic of the BmpView before I pass it to addControlView. The way I do this is like this:
Control* newView = new BmpView( BMP_PICTURE );
newView->changeColor( WHITE );
loadContainer->addControlView( newView );
Does this create an extra temporary/local object? Or is there an equal amount of memory allocated in both cases?
The only added memory allocated in your function is a new pointer *newView, which its size is pretty low and doesn't affect by the actual size of the BmpView. It doesn't allocate twice memory for BmpView.
I'm not considering any memory overhead of calling changeColor, which I assume wasn't the point of this question.
In both cases, there is single call to new, hence the amount of memory used is equal. (Note, this is without speculating about any allocation possibly requested by BmpView constructor, changeColor(), etc.)
However, you may wish to refactor your code to ensure some exception safety, avoid potential leaks hence ensure the amount of memory used is under control:
// C++11
std::unique_ptr<Control> newView(new BmpView(BMP_PICTURE));
// C++14, preferred
//auto newView = std::make_unique<BmpView>(BMP_PICTURE);
newView->changeColor( WHITE );
loadContainer->addControlView( newView.release() );
Reference for the below code/assembly :
https://godbolt.org/g/8DgmC1
#include <cstdio>
class ValueClass {
public:
// Class content not important...
int someValue;
};
void PrintValueClass(ValueClass* ptr) {
printf("%d\n", ptr->someValue);
}
int main() {
PrintValueClass(new ValueClass());
ValueClass* pValueClass = new ValueClass();
pValueClass->someValue = 55;
PrintValueClass(pValueClass);
return 1;
}
Compiled Assembly (PrintValueClass redacted as not important to question at hand) :
Example where you pass the (new ValueClass) directly to the function.
main:
push rbp
mov rbp, rsp
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 0
mov rdi, rax
call PrintValueClass(ValueClass*)
mov eax, 1
pop rbp
ret
Example where you create a local variable holding the pointer, do something to it, and then pass it to the function.
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 0
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 55
mov rax, QWORD PTR [rbp-8]
mov rdi, rax
call PrintValueClass(ValueClass*)
mov eax, 1
leave
ret
Before diving into the assembly, if your question is does the new operation occur twice if you store the pointer in a variable first, the answer is no. As shown through the assembly, the new 'function' is only called once, so only sizeof(ValueClass) is ever being allocated here through some sort of heap allocation function. But I feel the need to answer the question fully, even if it was not exactly intended to ask this question. Is extra memory used? Technically yes, realistically no.
The only difference between these two pieces of code is stack 'allocation', noted by the sub rsp, 16, which essentially means 'allocate' 16 bytes on the stack for local variables. So truly the only difference here is 16 bytes, which will greatly change on what compiler you use, what architecture you target, and many more factors.
At the end of the day, I would go as far to say, you would never care about the extra 16 bytes.
Let's mess around with very basic dynamically allocated memory. We take a vector of 3, set its elements and return the sum of the vector.
In the first test case I used a raw pointer with new[]/delete[]. In the second I used std::vector:
#include <vector>
int main()
{
//int *v = new int[3]; // (1)
auto v = std::vector<int>(3); // (2)
for (int i = 0; i < 3; ++i)
v[i] = i + 1;
int s = 0;
for (int i = 0; i < 3; ++i)
s += v[i];
//delete[] v; // (1)
return s;
}
Assembly of (1) (new[]/delete[])
main: # #main
mov eax, 6
ret
Assembly of (2) (std::vector)
main: # #main
push rax
mov edi, 12
call operator new(unsigned long)
mov qword ptr [rax], 0
movabs rcx, 8589934593
mov qword ptr [rax], rcx
mov dword ptr [rax + 8], 3
test rax, rax
je .LBB0_2
mov rdi, rax
call operator delete(void*)
.LBB0_2: # %std::vector<int, std::allocator<int> >::~vector() [clone .exit]
mov eax, 6
pop rdx
ret
Both outputs taken from https://gcc.godbolt.org/ with -std=c++14 -O3
In both versions the returned value is computed at compile time so we see just mov eax, 6; ret.
With the raw new[]/delete[] the dynamic allocation was completely removed. With std::vector however, the memory is allocated, set and freed.
This happens even with an unused variable auto v = std::vector<int>(3): call to new, memory is set and then call to delete.
I realize this is most likely a near impossible answer to give, but maybe someone has some insights and some interesting answers might pop out.
What are the contributing factors that don't allow compiler optimizations to remove the memory allocation in the std::vector case, like in the raw memory allocation case?
When using a pointer to a dynamically allocated array (directly using new[] and delete[]), the compiler optimized away the calls to operator new and operator delete even though they have observable side effects. This optimization is allowed by the C++ standard section 5.3.4 paragraph 10:
An implementation is allowed to omit a call to a replaceable global
allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage
is instead provided by the implementation or...
I'll show the rest of the sentence, which is crucial, at the end.
This optimization is relatively new because it was first allowed in C++14 (proposal N3664). Clang supported it since 3.4. The latest version of gcc, namely 5.3.0, doesn't take advantage of this relaxation of the as-if rule. It produces the following code:
main:
sub rsp, 8
mov edi, 12
call operator new[](unsigned long)
mov DWORD PTR [rax], 1
mov DWORD PTR [rax+4], 2
mov rdi, rax
mov DWORD PTR [rax+8], 3
call operator delete[](void*)
mov eax, 6
add rsp, 8
ret
MSVC 2013 also doesn't support this optimization. It produces the following code:
main:
sub rsp,28h
mov ecx,0Ch
call operator new[] ()
mov rcx,rax
mov dword ptr [rax],1
mov dword ptr [rax+4],2
mov dword ptr [rax+8],3
call operator delete[] ()
mov eax,6
add rsp,28h
ret
I currently don't have access to MSVC 2015 Update 1 and therefore I don't know whether it supports this optimization or not.
Finally, here is the assembly code generated by icc 13.0.1:
main:
push rbp
mov rbp, rsp
and rsp, -128
sub rsp, 128
mov edi, 3
call __intel_new_proc_init
stmxcsr DWORD PTR [rsp]
mov edi, 12
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
call operator new[](unsigned long)
mov rdi, rax
mov DWORD PTR [rax], 1
mov DWORD PTR [4+rax], 2
mov DWORD PTR [8+rax], 3
call operator delete[](void*)
mov eax, 6
mov rsp, rbp
pop rbp
ret
Clearly, it doesn't support this optimization. I don't have access to the latest version of icc, namely 16.0.
All of these code snippets have been produced with optimizations enabled.
When using std::vector, all of these compilers didn't optimize away the allocation. When a compiler doesn't perform an optimization, it's either because it cannot for some reason or it's just not yet supported.
What are the contributing factors that don't allow compiler
optimizations to remove the memory allocation in the std::vector case,
like in the raw memory allocation case?
The compiler didn't perform the optimization because it's not allowed to. To see this, let's see the rest of the sentence of paragraph 10 from 5.3.4:
An implementation is allowed to omit a call to a replaceable global
allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage
is instead provided by the implementation or provided by extending the
allocation of another new-expression.
What this is saying is that you can omit a call to a replaceable global allocation function only if it originated from a new-expression. A new-expression is defined in paragraph 1 of the same section.
The following expression
new int[3]
is a new-expression and therefore the compiler is allowed to optimize away the associated allocation function call.
On the other hand, the following expression:
::operator new(12)
is NOT a new-expression (see 5.3.4 paragraph 1). This is just a function call expression. In other words, this is treated as a typical function call. This function cannot be optimized away because its imported from another shared library (even if you linked the runtime statically, the function itself calls another imported function).
The default allocator used by std::vector allocates memory using ::operator new and therefore the compiler is not allowed to optimize it away.
Let's test this. Here's the code:
int main()
{
int *v = (int*)::operator new(12);
for (int i = 0; i < 3; ++i)
v[i] = i + 1;
int s = 0;
for (int i = 0; i < 3; ++i)
s += v[i];
delete v;
return s;
}
By compiling using Clang 3.7, we get the following assembly code:
main: # #main
push rax
mov edi, 12
call operator new(unsigned long)
movabs rcx, 8589934593
mov qword ptr [rax], rcx
mov dword ptr [rax + 8], 3
test rax, rax
je .LBB0_2
mov rdi, rax
call operator delete(void*)
.LBB0_2:
mov eax, 6
pop rdx
ret
This is exactly the same as assembly code generated when using std::vector except for mov qword ptr [rax], 0 which comes from the constructor of std::vector (the compiler should have removed it but failed to do so because of a flaw in its optimization algorithms).