Consider the following code:
#include <utility>
#include <string>
int bar() {
std::pair<int, std::string> p {
123, "Hey... no small-string optimization for me please!" };
return p.first;
}
(simplified thanks to #Jarod42 :-) ...)
I expect the function to be implemented as simply:
bar():
mov eax, 123
ret
but instead, the implementation calls operator new(), constructs an std::string with my literal, then calls operator delete(). At least - that's what gcc 9 and clang 9 do (GodBolt). Here's the clang output:
bar(): # #bar()
push rbx
sub rsp, 48
mov dword ptr [rsp + 8], 123
lea rax, [rsp + 32]
mov qword ptr [rsp + 16], rax
mov edi, 51
call operator new(unsigned long)
mov qword ptr [rsp + 16], rax
mov qword ptr [rsp + 32], 50
movups xmm0, xmmword ptr [rip + .L.str]
movups xmmword ptr [rax], xmm0
movups xmm0, xmmword ptr [rip + .L.str+16]
movups xmmword ptr [rax + 16], xmm0
movups xmm0, xmmword ptr [rip + .L.str+32]
movups xmmword ptr [rax + 32], xmm0
mov word ptr [rax + 48], 8549
mov qword ptr [rsp + 24], 50
mov byte ptr [rax + 50], 0
mov ebx, dword ptr [rsp + 8]
mov rdi, rax
call operator delete(void*)
mov eax, ebx
add rsp, 48
pop rbx
ret
.L.str:
.asciz "Hey... no small-string optimization for me please!"
My question is: Clearly, the compiler has full knowledge of everything going on inside bar(). Why is it not "eliding"/optimizing the string away? More specifically:
At the basic level there's the code between then new() and delete(), which AFAICT the compiler knows results in nothing useful.
Secondarily, the new() and delete() calls themselves. After all, small-string-optimization is allowed by the standard AFAIK, so even though clang/gcc hasn't chosen to use that - it could have; meaning that it's not actually required to call new() or delete() there.
I'm particularly interested in what part of this is directly due to the language standard, and what part is compiler non-optimality.
Nothing in your code represents "elision" as that term is commonly used in a C++ context. The compiler is not permitted to remove anything from that code on the grounds of "elision".
The only grounds a compiler has to remove the creation of that string is on the basis of the "as if" rule. That is, is the behavior of the string creation/destruction visible to the user and therefore not able to be removed?
Since it uses std::allocator and the standard character traits, the basic_string construction and destruction itself is not being overridden by the user. So there is some basis for the idea that the string's creation is not a visible side-effect of the function call and thus could be removed under the "as if" rule.
However, because std::allocator::allocate is specified to call ::operator new, and operator new is globally replaceable, it is reasonable to argue that this is a visible side effect of the construction of such a string. And therefore, the compiler cannot remove it under the "as if" rule.
If the compiler knows that you have not replaced operator new, then it can in theory optimize the string away.
That doesn't mean that any particular compiler will do so.
Following the discussion in various answers and comments here, I have now filed the following bugs against GCC and LLVM regarding this issue:
GCC bug 94293: [missed optimization] new+delete of unused local string not removed
Minimal testcase (GodBolt):
void foo() {
int *p = new int[1];
*p = 42;
delete[] p;
}
GCC bug 94294: [missed optimization] Useless statements populating local string not removed
Minimal testcase (GodBolt):
void foo() {
std::string s { "This is not a small string" };
}
LLVM bug 45287: [missed optimization] failure to drop unused libstdc++ std::string.
Minimal testcase (GodBolt):
void foo() {
std::string s { "This is not a small string" };
}
Thanks goes to: #JeffGarret, #NicolBolas, #Jarod42, Marc Glisse .
Update, August 2021: With recent versions of clang++, g++ and libstc++, all of these minimal testcases eschew memory allocation as one would expect. clang++ also has this behavior for OP's program in the question, but GCC still allocates and deallocates.
The question is can the program
int bar() {
std::pair<int, std::string> p {
123, "Hey... no small-string optimization for me please!" };
return p.first;
}
be validly be optimized to
int bar() {
return 123;
}
tldr, yes, I think.
And clang does with libc++: godbolt
About std::string, the standard says string.require/3
Every object of type basic_string uses an object of type Allocator to allocate and free storage for the contained charT objects as needed.
"as needed". std::string is allowed to decide when to use the allocator (which is I believe the justification for SSO being valid). Its member functions do not mandate an allocation. Therefore the allocation may be elided.
Related
I'm trying to experiment with code by myself here on different compilers.
I've been trying to lookup the advantages of disabling exceptions on certain functions (via the binary footprint) and to compare that to functions that don't disable exceptions, and I've actually stumbled onto a weird case where it's better to have exceptions than not.
I've been using Matt Godbolt's Compiler Explorer to do these checks, and it was checked on x86-64 clang 12.0.1 without any flags (on GCC this weird behavior doesn't exist).
Looking at this simple code:
auto* allocated_int()
{
return new int{};
}
int main()
{
delete allocated_int();
return 0;
}
Very straight-forward, pretty much deletes an allocated pointer returned from the function allocated_int().
As expected, the binary footprint is minimal, as well:
allocated_int(): # #allocated_int()
push rbp
mov rbp, rsp
mov edi, 4
call operator new(unsigned long)
mov rcx, rax
mov rax, rcx
mov dword ptr [rcx], 0
pop rbp
ret
Also, very straight-forward.
But the moment I apply the noexcept keyword to the allocated_int() function, the binary bloats. I'll apply the resulting assembly here:
allocated_int(): # #allocated_int()
push rbp
mov rbp, rsp
sub rsp, 16
mov edi, 4
call operator new(unsigned long)
mov rcx, rax
mov qword ptr [rbp - 8], rcx # 8-byte Spill
jmp .LBB0_1
.LBB0_1:
mov rcx, qword ptr [rbp - 8] # 8-byte Reload
mov rax, rcx
mov dword ptr [rcx], 0
add rsp, 16
pop rbp
ret
mov rdi, rax
call __clang_call_terminate
__clang_call_terminate: # #__clang_call_terminate
push rax
call __cxa_begin_catch
call std::terminate()
Why is clang doing this extra code for us? I didn't request any other action but calling new(), and I was expecting the binary to reflect that.
Thank you for those who can explain!
Why is clang doing this extra code for us?
Because the behaviour of the function is different.
I didn't request any other action but calling new()
By declaring the function noexcept, you've requested std::terminate to be called in case an exception propagates out of the function.
allocated_int in the first program never calls std::terminate, while
allocated_int in the second program may call std::terminate. Note that the amount of added code is much less if you remember to enable the optimiser. Comparing non-optimised assembly is mostly futile.
You can use non-throwing allocation to prevent that:
return new(std::nothrow) int{};
It's indeed an astute observation that doing potentially throwing things inside non-throwing function can introduce some extra work that wouldn't need to be done if the same things were done in a potentially throwing function.
I've been trying to lookup the advantages of disabling exceptions on certain functions
The advantage of using non-throwing is potentially realised where such function is called; not within the function itself.
Without nothrow, your function just acts as a front end to the allocation function you call. It doesn't have any real behavior of its own. In fact, in a real executable, if you do link-time optimization there's a pretty good chance that it'll completely disappear.
When you add noexcept, your code is silently transformed into something roughly like this:
auto* allocated_int()
{
try {
return new int{};
}
catch(...) {
terminate();
}
}
The extra code you see generated is what's needed to catch the exception and call terminate when/if needed.
I am curious why the following piece of code:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNO";
}
when compiled with -O3 yields the following code:
main: # #main
xor eax, eax
ret
(I perfectly understand that there is no need for the unused a so the compiler can entirely omit it from the generated code)
However the following program:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNOP"; // <-- !!! One Extra P
}
yields:
main: # #main
push rbx
sub rsp, 48
lea rbx, [rsp + 32]
mov qword ptr [rsp + 16], rbx
mov qword ptr [rsp + 8], 16
lea rdi, [rsp + 16]
lea rsi, [rsp + 8]
xor edx, edx
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)
mov qword ptr [rsp + 16], rax
mov rcx, qword ptr [rsp + 8]
mov qword ptr [rsp + 32], rcx
movups xmm0, xmmword ptr [rip + .L.str]
movups xmmword ptr [rax], xmm0
mov qword ptr [rsp + 24], rcx
mov rax, qword ptr [rsp + 16]
mov byte ptr [rax + rcx], 0
mov rdi, qword ptr [rsp + 16]
cmp rdi, rbx
je .LBB0_3
call operator delete(void*)
.LBB0_3:
xor eax, eax
add rsp, 48
pop rbx
ret
mov rdi, rax
call _Unwind_Resume
.L.str:
.asciz "ABCDEFGHIJKLMNOP"
when compiled with the same -O3. I don't understand why it does not recognize that the a is still unused, regardless that the string is one byte longer.
This question is relevant to gcc 9.1 and clang 8.0, (online: https://gcc.godbolt.org/z/p1Z8Ns) because other compilers in my observation either entirely drop the unused variable (ellcc) or generate code for it regardless the length of the string.
This is due to the small string optimization. When the string data is less than or equal 16 characters, including the null terminator, it is stored in a buffer local to the std::string object itself. Otherwise, it allocates memory on the heap and stores the data over there.
The first string "ABCDEFGHIJKLMNO" plus the null terminator is exactly of size 16. Adding "P" makes it exceed the buffer, hence new is being called internally, inevitably leading to a system call. The compiler can optimize something away if it's possible to ensure that there are no side effects. A system call probably makes it impossible to do this - by constrast, changing a buffer local to the object under construction allows for such a side effect analysis.
Tracing the local buffer in libstdc++, version 9.1, reveals these parts of bits/basic_string.h:
template<typename _CharT, typename _Traits, typename _Alloc>
class basic_string
{
// ...
enum { _S_local_capacity = 15 / sizeof(_CharT) };
union
{
_CharT _M_local_buf[_S_local_capacity + 1];
size_type _M_allocated_capacity;
};
// ...
};
which lets you spot the local buffer size _S_local_capacity and the local buffer itself (_M_local_buf). When the constructor triggers basic_string::_M_construct being called, you have in bits/basic_string.tcc:
void _M_construct(_InIterator __beg, _InIterator __end, ...)
{
size_type __len = 0;
size_type __capacity = size_type(_S_local_capacity);
while (__beg != __end && __len < __capacity)
{
_M_data()[__len++] = *__beg;
++__beg;
}
where the local buffer is filled with its content. Right after this part, we get to the branch where the local capacity is exhausted - new storage is allocated (through the allocate in M_create), the local buffer is copied into the new storage and filled with the rest of the initializing argument:
while (__beg != __end)
{
if (__len == __capacity)
{
// Allocate more space.
__capacity = __len + 1;
pointer __another = _M_create(__capacity, __len);
this->_S_copy(__another, _M_data(), __len);
_M_dispose();
_M_data(__another);
_M_capacity(__capacity);
}
_M_data()[__len++] = *__beg;
++__beg;
}
As a side note, small string optimization is quite a topic on its own. To get a feeling for how tweaking individual bits can make a difference at large scale, I'd recommend this talk. It also mentions how the std::string implementation that ships with gcc (libstdc++) works and changed during the past to match newer versions of the standard.
I was surprised the compiler saw through a std::string constructor/destructor pair until I saw your second example. It didn't. What you're seeing here is small string optimization and corresponding optimizations from the compiler around that.
Small string optimizations are when the std::string object itself is big enough to hold the contents of the string, a size and possibly a discriminating bit used to indicate whether the string is operating in small or big string mode. In such a case, no dynamic allocations occur and the string is stored in the std::string object itself.
Compilers are really bad at eliding unneeded allocations and deallocations, they are treated almost as if having side effects and are thus impossible to elide. When you go over the small string optimization threshold, dynamic allocations occur and the result is what you see.
As an example
void foo() {
delete new int;
}
is the simplest, dumbest allocation/deallocation pair possible, yet gcc emits this assembly even under O3
sub rsp, 8
mov edi, 4
call operator new(unsigned long)
mov esi, 4
add rsp, 8
mov rdi, rax
jmp operator delete(void*, unsigned long)
While the accepted answer is valid, since C++14 it's actually the case that new and delete calls can be optimized away. See this arcane wording on cppreference:
New-expressions are allowed to elide ... allocations made through replaceable allocation functions. In case of elision, the storage may be provided by the compiler without making the call to an allocation function (this also permits optimizing out unused new-expression).
...
Note that this optimization is only permitted when new-expressions are
used, not any other methods to call a replaceable allocation function:
delete[] new int[10]; can be optimized out, but operator
delete(operator new(10)); cannot.
This actually allows compilers to completely drop your local std::string even if it's very long. In fact - clang++ with libc++ already does this (GodBolt), since libc++ uses built-ins __new and __delete in its implementation of std::string - that's "storage provided by the compiler". Thus, we get:
main():
xor eax, eax
ret
with basically any-length unused string.
GCC doesn't do but I've recently opened bug reports about this; see this SO answer for links.
Good evening.
I know C-style arrays or std::array aren't faster than vectors. I use vectors all the time (and I use them well). However, I have some situation in which the use of std::array performs better than with std::vector, and I have no clue why (tested with clang 7.0 and gcc 8.2).
Let me share a simple code:
#include <vector>
#include <array>
// some size constant
const size_t N = 100;
// some vectors and arrays
using vec = std::vector<double>;
using arr = std::array<double,3>;
// arrays are constructed faster here due to known size, but it is irrelevant
const vec v1 {1.0,-1.0,1.0};
const vec v2 {1.0,2.0,1.0};
const arr a1 {1.0,-1.0,1.0};
const arr a2 {1.0,2.0,1.0};
// vector to store combinations of vectors or arrays
std::vector<double> glob(N,0.0);
So far, so good. The above code which initializes the variables is not included in the benchmark. Now, let's write a function to combine elements (double) of v1 and v2, or of a1 and a2:
// some combination
auto comb(const double m, const double f)
{
return m + f;
}
And the benchmark functions:
void assemble_vec()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(v1[0],v2[0]);
glob[i+1] += comb(v1[1],v2[1]);
glob[i+2] += comb(v1[2],v2[2]);
}
}
void assemble_arr()
{
for (size_t i=0; i<N-2; ++i)
{
glob[i] += comb(a1[0],a2[0]);
glob[i+1] += comb(a1[1],a2[1]);
glob[i+2] += comb(a1[2],a2[2]);
}
}
I've tried this with clang 7.0 and gcc 8.2. In both cases, the array version goes almost twice as fast as the vector version.
Does anyone know why? Thanks!
GCC (and probably Clang) are optimizing out the Arrays, but not the Vectors
Your base assumption that arrays are necessarily slower than vectors is incorrect. Because vectors require their data to be stored in allocated memory (which with a default allocator uses dynamic memory), the values that need to be used have to be stored in heap memory and accessed repeatedly during the execution of this program. Conversely, the values used by the array can be optimized out entirely and simply directly referenced in the assembly of the program.
Below is what GCC spit out as assembly for the assemble_vec and assemble_arr functions once optimizations were turned on:
[-snip-]
//==============
//Vector Version
//==============
assemble_vec():
mov rax, QWORD PTR glob[rip]
mov rcx, QWORD PTR v2[rip]
mov rdx, QWORD PTR v1[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rsi, [rax+784]
.L23:
movsd xmm2, QWORD PTR [rcx]
addsd xmm2, QWORD PTR [rdx]
add rax, 8
addsd xmm0, xmm2
movsd QWORD PTR [rax-8], xmm0
movsd xmm0, QWORD PTR [rcx+8]
addsd xmm0, QWORD PTR [rdx+8]
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
movsd xmm1, QWORD PTR [rcx+16]
addsd xmm1, QWORD PTR [rdx+16]
addsd xmm1, QWORD PTR [rax+8]
movsd QWORD PTR [rax+8], xmm1
cmp rax, rsi
jne .L23
ret
//=============
//Array Version
//=============
assemble_arr():
mov rax, QWORD PTR glob[rip]
movsd xmm2, QWORD PTR .LC1[rip]
movsd xmm3, QWORD PTR .LC2[rip]
movsd xmm1, QWORD PTR [rax+8]
movsd xmm0, QWORD PTR [rax]
lea rdx, [rax+784]
.L26:
addsd xmm1, xmm3
addsd xmm0, xmm2
add rax, 8
movsd QWORD PTR [rax-8], xmm0
movapd xmm0, xmm1
movsd QWORD PTR [rax], xmm1
movsd xmm1, QWORD PTR [rax+8]
addsd xmm1, xmm2
movsd QWORD PTR [rax+8], xmm1
cmp rax, rdx
jne .L26
ret
[-snip-]
There are several differences between these sections of code, but the critical difference is after the .L23 and .L26 labels respectively, where for the vector version, the numbers are being added together through less efficient opcodes, as compared to the array version, which is using (more) SSE instructions. The vector version also involves more memory lookups compared to the array version. These factors in combination with each other is going to result in code that executes faster for the std::array version of the code than it will for the std::vector version.
C++ aliasing rules don't let the compiler prove that glob[i] += stuff doesn't modify one of the elements of const vec v1 {1.0,-1.0,1.0}; or v2.
const on a std::vector means the "control block" pointers can be assumed to not be modified after it's constructed, but the memory is still dynamically allocated an all the compiler knows is that it effectively has a const double * in static storage.
Nothing in the std::vector implementation lets the compiler rule out some other non-const pointer pointing into that storage. For example, the double *data in the control block of glob.
C++ doesn't provide a way for library implementers to give the compiler the information that the storage for different std::vectors doesn't overlap. They can't use __restrict (even on compilers that support that extension) because that could break programs that take the address of a vector element. See the C99 documentation for restrict.
But with const arr a1 {1.0,-1.0,1.0}; and a2, the doubles themselves can go in read-only static storage, and the compiler knows this. Therefore it can evaluate comb(a1[0],a2[0]); and so on at compile time. In #Xirema's answer, you can see the asm output loads constants .LC1 and .LC2. (Only two constants because both a1[0]+a2[0] and a1[2]+a2[2] are 1.0+1.0. The loop body uses xmm2 as a source operand for addsd twice, and the other constant once.)
But couldn't the compiler still do the sums once outside the loop at runtime?
No, again because of potential aliasing. It doesn't know that stores into glob[i+0..3] won't modify the contents of v1[0..2], so it reloads from v1 and v2 every time through the loop after the store into glob.
(It doesn't have to reload the vector<> control block pointers, though, because type-based strict aliasing rules let it assume that storing a double doesn't modify a double*.)
The compiler could have checked that glob.data() + 0 .. N-3 didn't overlap with either of v1/v1.data() + 0 .. 2, and made a different version of the loop for that case, hoisting the three comb() results out of the loop.
This is a useful optimization that some compilers do when auto-vectorizing if they can't prove lack of aliasing; it's clearly a missed optimization in your case that gcc doesn't check for overlap because it would make the function run much faster. But the question is whether the compiler could reasonably guess that it was worth emitting asm that checks at runtime for overlap, and has 2 different versions of the same loop. With profile-guided optimization, it would know the loop is hot (runs many iterations), and would be worth spending extra time on. But without that, the compiler might not want to risk bloating the code too much.
ICC19 (Intel's compiler) in fact does do something like that here, but it's weird: if you look at the beginning of assemble_vec (on the Godbolt compiler explorer), it load the data pointer from glob, then adds 8 and subtracts the pointer again, producing a constant 8. Then it branches at runtime on 8 > 784 (not taken) and then -8 < 784 (taken). It looks like this was supposed to be an overlap check, but it maybe used the same pointer twice instead of v1 and v2? (784 = 8*100 - 16 = sizeof(double)*N - 16)
Anyway, it ends up running the ..B2.19 loop that hoists all 3 comb() calculations, and interestingly does 2 iterations at once of the loop with 4 scalar loads and stores to glob[i+0..4], and 6 addsd (scalar double) add instructions.
Elsewhere in the function body, there's a vectorized version that uses 3x addpd (packed double), just storing / reloading 128-bit vectors that partially overlap. This will cause store-forwarding stalls, but out-of-order execution may be able to hide that. It's just really weird that it branches at runtime on a calculation that will produce the same result every time, and never uses that loop. Smells like a bug.
If glob[] had been a static array, you'd still have had a problem. Because the compiler can't know that v1/v2.data() aren't pointing into that static array.
I thought if you accessed it through double *__restrict g = &glob[0];, there wouldn't have been a problem at all. That will promise the compiler that g[i] += ... won't affect any values that you access through other pointers, like v1[0].
In practice, that does not enable hoisting of comb() for gcc, clang, or ICC -O3. But it does for MSVC. (I've read that MSVC doesn't do type-based strict aliasing optimizations, but it's not reloading glob.data() inside the loop so it has somehow figured out that storing a double won't modify a pointer. But MSVC does define the behaviour of *(int*)my_float for type-punning, unlike other C++ implementations.)
For testing, I put this on Godbolt
//__attribute__((noinline))
void assemble_vec()
{
double *__restrict g = &glob[0]; // Helps MSVC, but not gcc/clang/ICC
// std::vector<double> &g = glob; // actually hurts ICC it seems?
// #define g glob // so use this as the alternative to __restrict
for (size_t i=0; i<N-2; ++i)
{
g[i] += comb(v1[0],v2[0]);
g[i+1] += comb(v1[1],v2[1]);
g[i+2] += comb(v1[2],v2[2]);
}
}
We get this from MSVC outside the loop
movsd xmm2, QWORD PTR [rcx] # v2[0]
movsd xmm3, QWORD PTR [rcx+8]
movsd xmm4, QWORD PTR [rcx+16]
addsd xmm2, QWORD PTR [rax] # += v1[0]
addsd xmm3, QWORD PTR [rax+8]
addsd xmm4, QWORD PTR [rax+16]
mov eax, 98 ; 00000062H
Then we get an efficient-looking loop.
So this is a missed-optimization for gcc/clang/ICC.
I think the point is that you use too small storage size (six doubles), this allows the compiler, in the std::array case, to completely eliminate in RAM storing by placing values in the registers. The compiler can store stack variables to registers if it more optimal. This decrease memory accesses by half (only writing to glob remains). In the case of a std::vector, the compiler cannot perform such an optimization since dynamic memory is used. Try to use significantly larger sizes for a1, a2, v1, v2
Is reference is compiled as usual pointer or it has other stuff behind?
And how does it differ in clang?
You can think of a reference as an immutable pointer that is automatically de-referenced on usage. This isn't what the C++ standard says, so you cannot rely on that being an actual implementation.
Practically speaking though, it likely to be what you see in many cases.
Take the following example in the case of parameter passing:
#include <stdio.h>
void function (int *const n){
printf("%d",*n);
}
void function (int & n){
printf("%d",n);
}
int main(){
int n = 123;
function(&n);
function(n);
}
Both gcc and clang produce identical code for the functions without any optimizations enabled:
function(int*):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
function(int&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
How does reference translates to asm in gcc?
In general: It depends.
To find that out in a specific case, you can test by reading the generated assembly.
Is reference is compiled as usual pointer or it has other stuff behind?
The implementation of code using a reference is practically identical to one using a pointer to achieve identical indirection. How each are implemented, is not guaranteed by the standard, but there is no need for implementing them differently.
References only differ from pointers in the way that they are allowed to use by the rules of C++. Of course, because the rules are different, pointers can be used in a way that references can not. And in such case you cannot compare whether pointers generate the same assembly.
The limitations of a reference might make some optimizations easier, so there might be a difference, but such optimization would also have been possible with pointers, so there is no guarantee of different assembly output when using references instead of pointers.
And how does it differ in clang?
In general: It depends.
Both compilers are bound by the same rules of the standard. They might generate identical assembly or different. How the generated assembly of particular version of one compiler differs (if at all) from the assembly generated by a particular version of another compiler, with particular compilation options for each, on particular processor architecture on particular operating system, in a particular use case of a reference, can be found by inspecting and comparing the generated assembly in each particular case.
Let's mess around with very basic dynamically allocated memory. We take a vector of 3, set its elements and return the sum of the vector.
In the first test case I used a raw pointer with new[]/delete[]. In the second I used std::vector:
#include <vector>
int main()
{
//int *v = new int[3]; // (1)
auto v = std::vector<int>(3); // (2)
for (int i = 0; i < 3; ++i)
v[i] = i + 1;
int s = 0;
for (int i = 0; i < 3; ++i)
s += v[i];
//delete[] v; // (1)
return s;
}
Assembly of (1) (new[]/delete[])
main: # #main
mov eax, 6
ret
Assembly of (2) (std::vector)
main: # #main
push rax
mov edi, 12
call operator new(unsigned long)
mov qword ptr [rax], 0
movabs rcx, 8589934593
mov qword ptr [rax], rcx
mov dword ptr [rax + 8], 3
test rax, rax
je .LBB0_2
mov rdi, rax
call operator delete(void*)
.LBB0_2: # %std::vector<int, std::allocator<int> >::~vector() [clone .exit]
mov eax, 6
pop rdx
ret
Both outputs taken from https://gcc.godbolt.org/ with -std=c++14 -O3
In both versions the returned value is computed at compile time so we see just mov eax, 6; ret.
With the raw new[]/delete[] the dynamic allocation was completely removed. With std::vector however, the memory is allocated, set and freed.
This happens even with an unused variable auto v = std::vector<int>(3): call to new, memory is set and then call to delete.
I realize this is most likely a near impossible answer to give, but maybe someone has some insights and some interesting answers might pop out.
What are the contributing factors that don't allow compiler optimizations to remove the memory allocation in the std::vector case, like in the raw memory allocation case?
When using a pointer to a dynamically allocated array (directly using new[] and delete[]), the compiler optimized away the calls to operator new and operator delete even though they have observable side effects. This optimization is allowed by the C++ standard section 5.3.4 paragraph 10:
An implementation is allowed to omit a call to a replaceable global
allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage
is instead provided by the implementation or...
I'll show the rest of the sentence, which is crucial, at the end.
This optimization is relatively new because it was first allowed in C++14 (proposal N3664). Clang supported it since 3.4. The latest version of gcc, namely 5.3.0, doesn't take advantage of this relaxation of the as-if rule. It produces the following code:
main:
sub rsp, 8
mov edi, 12
call operator new[](unsigned long)
mov DWORD PTR [rax], 1
mov DWORD PTR [rax+4], 2
mov rdi, rax
mov DWORD PTR [rax+8], 3
call operator delete[](void*)
mov eax, 6
add rsp, 8
ret
MSVC 2013 also doesn't support this optimization. It produces the following code:
main:
sub rsp,28h
mov ecx,0Ch
call operator new[] ()
mov rcx,rax
mov dword ptr [rax],1
mov dword ptr [rax+4],2
mov dword ptr [rax+8],3
call operator delete[] ()
mov eax,6
add rsp,28h
ret
I currently don't have access to MSVC 2015 Update 1 and therefore I don't know whether it supports this optimization or not.
Finally, here is the assembly code generated by icc 13.0.1:
main:
push rbp
mov rbp, rsp
and rsp, -128
sub rsp, 128
mov edi, 3
call __intel_new_proc_init
stmxcsr DWORD PTR [rsp]
mov edi, 12
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
call operator new[](unsigned long)
mov rdi, rax
mov DWORD PTR [rax], 1
mov DWORD PTR [4+rax], 2
mov DWORD PTR [8+rax], 3
call operator delete[](void*)
mov eax, 6
mov rsp, rbp
pop rbp
ret
Clearly, it doesn't support this optimization. I don't have access to the latest version of icc, namely 16.0.
All of these code snippets have been produced with optimizations enabled.
When using std::vector, all of these compilers didn't optimize away the allocation. When a compiler doesn't perform an optimization, it's either because it cannot for some reason or it's just not yet supported.
What are the contributing factors that don't allow compiler
optimizations to remove the memory allocation in the std::vector case,
like in the raw memory allocation case?
The compiler didn't perform the optimization because it's not allowed to. To see this, let's see the rest of the sentence of paragraph 10 from 5.3.4:
An implementation is allowed to omit a call to a replaceable global
allocation function (18.6.1.1, 18.6.1.2). When it does so, the storage
is instead provided by the implementation or provided by extending the
allocation of another new-expression.
What this is saying is that you can omit a call to a replaceable global allocation function only if it originated from a new-expression. A new-expression is defined in paragraph 1 of the same section.
The following expression
new int[3]
is a new-expression and therefore the compiler is allowed to optimize away the associated allocation function call.
On the other hand, the following expression:
::operator new(12)
is NOT a new-expression (see 5.3.4 paragraph 1). This is just a function call expression. In other words, this is treated as a typical function call. This function cannot be optimized away because its imported from another shared library (even if you linked the runtime statically, the function itself calls another imported function).
The default allocator used by std::vector allocates memory using ::operator new and therefore the compiler is not allowed to optimize it away.
Let's test this. Here's the code:
int main()
{
int *v = (int*)::operator new(12);
for (int i = 0; i < 3; ++i)
v[i] = i + 1;
int s = 0;
for (int i = 0; i < 3; ++i)
s += v[i];
delete v;
return s;
}
By compiling using Clang 3.7, we get the following assembly code:
main: # #main
push rax
mov edi, 12
call operator new(unsigned long)
movabs rcx, 8589934593
mov qword ptr [rax], rcx
mov dword ptr [rax + 8], 3
test rax, rax
je .LBB0_2
mov rdi, rax
call operator delete(void*)
.LBB0_2:
mov eax, 6
pop rdx
ret
This is exactly the same as assembly code generated when using std::vector except for mov qword ptr [rax], 0 which comes from the constructor of std::vector (the compiler should have removed it but failed to do so because of a flaw in its optimization algorithms).