Memory allocation optimized away by compilers - c++

In his talk "Efficiency with algorithms, Performance with data structures", Chandler Carruth talks about the need for a better allocator model in C++. The current allocator model invades the type system and makes it almost impossible to work in many project because of that. On the other hand, the Bloomberg allocator model does not invade the type system but is based on virtual functions calls which makes impossible for the compiler to 'see' the allocation and optimize it. In his talk he talks about compilers deduplicating memory allocation (1:06:47).
It took me some time to find some examples, of memory allocation optimization, but I have found this code sample, which compiled under clang, optimize away all the memory allocation and just return 1000000 without allocating anything.
template<typename T>
T* create() { return new T(); }
int main() {
auto result = 0;
for (auto i = 0; i < 1000000; ++i) {
result += (create<int>() != nullptr);
}
return result;
}
The following paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html also says that allocation could be fused in compilers and seems to suggest that some compilers already do that sort of things.
As I am very interested in the strategies for efficient memory allocation, I really want to understand why Chandler Carruth is against virtual calls in the Bloomberg model. The example above clearly shows that clang optimize things away when it can see the allocation.
I would like to have a "real life code" where this optimization is useful and done by any current compiler
Do you have any example of code where different allocation are fused by anu current compiler?
Do you understand what Chandler Carruth means when he says that compilers can "deduplicate" your allocation in his talk at 1:06:47?

I have found this amazing example which answers the first point of the initial question. Both points 2 and 3 don't have any answer yet.
#include <iostream>
#include <vector>
#include <chrono>
std::vector<double> f_val(std::size_t i, std::size_t n) {
auto v = std::vector<double>( n );
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] = static_cast<double>(k + i);
}
return v;
}
void f_ref(std::size_t i, std::vector<double>& v) {
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] = static_cast<double>(k + i);
}
}
int main (int argc, char const *argv[]) {
const auto n = std::size_t{10};
const auto nb_loops = std::size_t{300000000};
// Begin: Zone 1
{
auto v = std::vector<double>( n, 0.0 );
auto start_time = std::chrono::high_resolution_clock::now();
for (std::size_t i = 0; i < nb_loops; ++i) {
auto w = f_val(i, n);
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] += w[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
std::cout << time << std::endl;
std::cout << v[0] << " " << v[n - 1] << std::endl;
}
// End: Zone 1
{
auto v = std::vector<double>( n, 0.0 );
auto w = std::vector<double>( n );
auto start_time = std::chrono::high_resolution_clock::now();
for (std::size_t i = 0; i < nb_loops; ++i) {
f_ref(i, w);
for (std::size_t k = 0; k < v.size(); ++k) {
v[k] += w[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
std::cout << time << std::endl;
std::cout << v[0] << " " << v[n - 1] << std::endl;
}
return 0;
}
where not a single memory allocation happens in the for loop with f_val. This only happens with Clang though (Gcc and icpc both fail on this) and when building a slighly more complicated example, the optimization is not done.

Related

Why slower function run faster if surrounded with another functions?

Just a little c++ code, confirmed behavior in java.
This is example code what reproduce this behavior compiled with Visual Studio 2019 Release x64. I got:
611ms for just increment element.
631ms for increment element with cache, so additional 20ms for overhead.
But when i add heavy op for before each increment(i choised random number generation) and got:
2073ms for just increment element.
1432ms for increment element using cache.
I have intel cpu 10700K, and 3200RAM if it matter.
#include <iostream>
#include <random>
#include <chrono>
#include <cstdlib>
#define ARR_SIZE 256 * 256 * 256
#define ACCESS_SIZE 256 * 256
#define CACHE_SIZE 1024
#define ITERATIONS 1000
using namespace std;
using chrono::high_resolution_clock;
using chrono::duration_cast;
using chrono::milliseconds;
int* arr;
int* cache;
int counter = 0;
void flushCache() {
for (int j = 0; j < CACHE_SIZE; ++j)
{
++arr[cache[j]];
}
counter = 0;
}
void incWithCache(int i) {
cache[counter] = i;
++counter;
if (counter == CACHE_SIZE) {
flushCache();
}
}
void incWithoutCache(int i) {
++arr[i];
}
int heavyOp() {
return rand() % 107;
}
void main()
{
arr = new int[ARR_SIZE];
cache = new int[CACHE_SIZE];
int* access = new int[ACCESS_SIZE];
random_device rd;
mt19937 gen(rd());
for (int i = 0; i < ACCESS_SIZE; ++i) {
access[i] = gen() % (ARR_SIZE);
}
for (int i = 0; i < ARR_SIZE; ++i) {
arr[i] = 0;
}
auto t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
incWithoutCache(access[i]);
}
}
auto t2 = high_resolution_clock::now();
auto ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time without cache " << ms_int.count() << "ms\n";
t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
incWithCache(access[i]);
}
flushCache();
}
t2 = high_resolution_clock::now();
ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time with cache " << ms_int.count() << "ms\n";
t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
heavyOp();
incWithoutCache(access[i]);
}
}
t2 = high_resolution_clock::now();
ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time without cache and time between " << ms_int.count() << "ms\n";
t1 = high_resolution_clock::now();
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (int i = 0; i < ACCESS_SIZE; ++i) {
heavyOp();
incWithCache(access[i]);
}
flushCache();
}
t2 = high_resolution_clock::now();
ms_int = duration_cast<milliseconds>(t2 - t1);
cout << "Time with cache and time between " << ms_int.count() << "ms\n";
}
I think these kind of questions are extremely hard to answer - optimizing compilers, instruction reordering, and caching all make this hard to analyze but I do have a hypothesis.
First, the difference between incWithoutCache and incWithCache without heavyOp seems reasonable - the second one is simply doing more work.
When you introduce heavyOp is where it gets interesting.
heavyOp + incWithoutCache: incWithoutCache requires a fetch from memory to output to arr. When that memory fetch is complete, it can do the addition. The processor may begin the next heavyOp operation before the increment is complete because of pipelining.
heavyOp + incWithCache: incWithCache does not require a fetch from memory in every iteration as it only has to write out a value. The processor can queue that write to a memory controller and continue. It does do ++counter, but in that case you are always accessing the same value and so I assume this could be cached better then ++arr[i] could from incWithoutCache. When the cache is flushed, the flushing loop can be heavily pipelined - each iteration of the flushing loop is independent and so many iterations will be operating at a time.
So I think the big difference here is that the actual writes to arr cannot be as efficiently pipelined without the cache because heavyOp is trashing your pipeline and potentially your cache. Your heavyOp is taking the same amount of time in either case, but in heavyOp + incWithoutCache the amortized cost of a write to arr is higher because it is not overlapped with other writes to arr such as what can occur with heavyOp + incWithCache.
I think vectorization could theoretically be used for the flushing operation but I did not see that on Compiler Explorer so that may not be a cause of the discrepancy. If vectorization was being used that could explain this speed difference.
I will say I am not an expert in this and could easily be completely wrong about all of this... but it makes sense to me.

memcpy beats SIMD intrinsics

I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.
I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.
I have got four versions to copy data:
1. Baseline
Copies element by element:
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
2. NEON
This code loads four values into a temporary register, then copies the register to output.
Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}
3. Stepped memcpy,
Uses the memcpy, but copies 4 elements at a time. This is to compare against the NEON version.
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
4. Normal memcpy
Uses memcpy with full amount of data.
memcpy(orig, copy4, size);
My benchmark using 2^16 values gave some surprising results:
1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]
The speedup for NEON time is expected, however the faster stepped memcpy time is surprising to me. And the time for 4 even more so.
Why is memcpy doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?
This question discussed NEON versus memcpy(). However I don't feel the answers explore sufficently why the ARM memcpy implementation runs so well
The full code listing is below:
#include <arm_neon.h>
#include <vector>
#include <cinttypes>
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>
int main(int argc, char *argv[]) {
int arr_size;
if (argc==1)
{
std::cout << "Please enter an array size" << std::endl;
exit(1);
}
int size = atoi(argv[1]); // not very C++, sorry
std::int32_t* orig = new std::int32_t[size];
std::int32_t* copy = new std::int32_t[size];
std::int32_t* copy2 = new std::int32_t[size];
std::int32_t* copy3 = new std::int32_t[size];
std::int32_t* copy4 = new std::int32_t[size];
// Non-neon version
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// NEON version
begin = std::chrono::steady_clock::now();
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}
end = std::chrono::steady_clock::now();
std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
memcpy(orig, copy4, size);
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
return 0;
}
Note: this code uses memcpy in the wrong direction. It should be memcpy(dest, src, num_bytes).
Because the "normal memcpy" test happens last, the massive order of magnitude speedup vs. other tests would be explained by dead code elimination. The optimizer saw that orig is not used after the last memcpy, so it eliminated the memcpy.
A good way to write reliable benchmarks is with the Benchmark framework, and use their benchmark::DoNotOptimize(x) function prevent dead code elimination.

Copy local array is faster than array from arguments in c++?

While optimizing some code I discovered some things that I didn't expected.
I wrote a simple code to illustrate what I found below:
#include <string.h>
#include <chrono>
#include <iostream>
using namespace std;
int globalArr[1024][1024];
void initArr(int arr[1024][1024])
{
memset(arr, 0, 1024 * 1024 * sizeof(int));
}
void run()
{
int arr[1024][1024];
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
void run2(int arr[1024][1024])
{
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
int main()
{
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
run();
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
int arr[1024][1024];
run2(arr);
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run2) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
return 0;
}
I build the code with g++ version 6.4.0 20180424 with -O3 flag.
Below is the result running on ryzen 1700.
(run) Total time: 43493 microseconds
(run2) Total time: 134740 microseconds
I tried to see the assembly with godbolt.org (Code separated in 2 urls)
https://godbolt.org/g/aKSHH6
https://godbolt.org/g/zfK14x
But I still don't understand what actually made the difference.
So my questions are:
1. What's causing the performance difference?
2. Is it possible passing array in argument with the same performance as local array?
Edit:
Just some extra info, below is the result build using O2
(run) Total time: 94461 microseconds
(run2) Total time: 172352 microseconds
Edit again:
From xaxxon's comment, I try remove the initArr call in both functions. And the result actually run2 is better than run
(run) Total time: 45151 microseconds
(run2) Total time: 35845 microseconds
But I still don't understand the reason.
What's causing the performance difference?
The compiler has to generate code for run2 that will continue to work correctly if you call
run2(globalArr);
or (worse), pass in some overlapping but non-identical address.
If you allow your C++ compiler to inline the call, and it chooses to do so, it'll be able to generate inlined code that knows whether the parameter really aliases your global. The out-of-line codegen still has to be conservative though.
Is it possible passing array in argument with the same performance as local array?
You can certainly fix the aliasing problem in C, using the restrict keyword, like
void run2(int (* restrict globalArr2)[256])
{
int (* restrict g)[256] = globalArr1;
for(int i = 0; i < 32; ++i)
{
for(int j = 0; j < 256; ++j)
{
g[i][j] = globalArr2[i][j];
}
}
}
(or probably in C++ using the non-standard extension __restrict).
This should allow the optimizer as much freedom as it had in your original run - unless it's smart enough to elide the local entirely and simply set the global to zero.

C++ map: Is it faster to create a temporary variable or access the value with the key each time

Which is faster:
(*connectionsById)[connection->nextConnectionId].reachable = calculationId;
(*connectionsById)[connection->nextConnectionId].numBoardings = connection->numBoardings;
(*connectionsById)[connection->nextConnectionId].journeySteps = connection->journeySteps;
Or:
Connection& tmpConnection = (*connectionsById)[connection->nextConnectionId];
tmpConnection.reachable = calculationId;
tmpConnection.numBoardings = connection->numBoardings;
tmpConnection.journeySteps = connection->journeySteps;
Does the compiler will figure it out anyway?
(I am a beginner on C++)
These are not equal:
(*connectionsById)[connection->nextConnectionId].reachable = calculationId;
(*connectionsById)[connection->nextConnectionId].numBoardings = connection->numBoardings;
(*connectionsById)[connection->nextConnectionId].journeySteps = connection->journeySteps;
This will dereference connectionsById 3 times and set it's member variables accordingly.
Connection tmpConnection = (*connectionsById)[connection->nextConnectionId];
tmpConnection.reachable = calculationId;
tmpConnection.numBoardings = connection->numBoardings;
tmpConnection.journeySteps = connection->journeySteps;
This will create a new object tmpConnection, copied from whatever object connectionsById is pointing to. Then it will modify the copy and not the original object.
If you want modify the original object you should use a reference: (using &)
Connection& tmpConnection = (*connectionsById)[connection->nextConnectionId];
tmpConnection.reachable = calculationId;
tmpConnection.numBoardings = connection->numBoardings;
tmpConnection.journeySteps = connection->journeySteps;
Now, asking the same question with the reworked code snippet, both are similar/equal under the hood and thus the performance difference will be minimal/none.
The second one will be faster, but tmpConnection needs to be a reference if you actually want to change the values in the map instead of just that copy.
If you did use a reference (Connection& tmpConnection = ...), I doubt that the compiler would be smart enough to make the first code as fast as the second. The first code performs 3 searches by key in the map while the second on performs the search once and caches the result. The compiler would have to assume a lot of things to make that kind of optimization. (For instance, you can use a map with your own comparison function for ordering and searching for key equality. There is nothing stopping you from changing how that function behaves between the different searches. Therefore, the compiler cannot assume the searches will achieve the same results).
However, if the map is relatively small, this difference may be negligible and writing clearer code may be of more importance.
Edit:
Here is an exaggerated test that shows the difference:
#include <iostream>
#include <map>
#include <chrono>
using namespace std;
int main()
{
int niter = 100000;
int nsearch = 1000;
map<int, size_t> m = {{0,0}, {1,0}, {2,0}, {3,0}, {4,0}, {5,0}, {6,0}, {7,0}, {8,0}, {9,0}};
auto t1 = chrono::steady_clock::now();
for(size_t iter=0; iter<niter; ++iter) {
for(size_t i=0; i<m.size(); ++i) {
for(int k=0; k<nsearch; ++k) {
m[i] += k;
}
}
}
auto t2 = chrono::steady_clock::now();
map<int, size_t> m2 = {{0,0}, {1,0}, {2,0}, {3,0}, {4,0}, {5,0}, {6,0}, {7,0}, {8,0}, {9,0}};
auto t3 = chrono::steady_clock::now();
for(size_t iter=0; iter<niter; ++iter) {
for(size_t i=0; i<m.size(); ++i) {
size_t& val = m2[i];
for(int k=0; k<nsearch; ++k) {
val += k;
}
}
}
auto t4 = chrono::steady_clock::now();
cout << "time 1: " << chrono::duration_cast<chrono::milliseconds>(t2-t1).count() << " ms" << endl;
cout << "time 2: " << chrono::duration_cast<chrono::milliseconds>(t4-t3).count() << " ms" << endl;
return 0;
}
Compiling this with -O3 with clang, I get these results:
time 1: 2915 ms
time 2: 3 ms
Again, for most cases, it is probably negligible, but caching the key lookup can have a big effect in some cases.
It's very hard to say what an optimiser will do, so normally the rule is just to write the code for clarity. If you're in an inner loop, cacheing the structure may help, but you need to time it to be sure.

Why no significant performance differences for this code with different param passing strategies?

I am trying to write some piece of code and convince myself that pass by value, pass by reference(rvalue and lvalue reference) should have significant impact on performance (related question). And later I came up with this code below and I thought the performance differences should be visible.
#include <iostream>
#include <vector>
#include <chrono>
#define DurationTy std::chrono::duration_cast<std::chrono::milliseconds>
typedef std::vector<int> VectTy;
size_t const MAX = 10000u;
size_t const NUM = MAX / 10;
int randomize(int mod) { return std::rand() % mod; }
VectTy factory(size_t size, bool pos) {
VectTy vect;
if (pos) {
for (size_t i = 0u; i < size; i++) {
// vect.push_back(randomize(size));
vect.push_back(i);
}
} else {
for (size_t i = 0u; i < size * 2; i++) {
vect.push_back(i);
// vect.push_back(randomize(size));
}
}
return vect;
}
long d1(VectTy vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
long d2(VectTy& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
long d3(VectTy&& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
int main(void) {
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
total += d1(factory(MAX, i % 2)); // T1
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
VectTy vect = factory(MAX, i % 2); // T2
total += d1(vect);
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
VectTy vect = factory(MAX, i % 2); // T3
total += d2(vect);
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
total += d3(factory(MAX, i % 2)); // T4
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
return 0;
}
I tested it on both gcc(4.9.2) and clang(trunk) with -std=c++11 option.
However I found that only when compiling with clang T2 takes more time (for one run, in milliseconds, 755,924,752,750). And I also compiled the -fno-elide-constructors version but with similar results.
(update: there are slight performance differences for T1, T3, T4 when compiled with Clang (trunk) on Mac OS X.)
My questions:
What are the optimizations applied that bridge the potential performance gaps between T1, T2, T3 in theory? (You can see that I also tried to avoid RVO in factory.)
What is the possible optimization applied for T2 by gcc in this case?
This is because of r-value references. you are passing in std::vector by value- which compiler figures out has move constructor and optimizes the copy to move.
See following link for details about rvalue refs: http://thbecker.net/articles/rvalue_references/section_01.html
update:
The following three methods turn out equivalent:
In here, you are passing in return of factory directly in function d1, compiler knows that the value returned is a temporary and std::vector (VectTy) has a move constructor defined- it just calls that move constructor (so this function is equivalent to d3
long d1(VectTy vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
Here you are passing by reference so no copy- OTOH and this shouldn't have compiled. unless you are using MSVC- in that case you should disable language extensions
long d2(VectTy& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
Of course there won't be any copy here, you are moving temporary vector (rvalue) from factory to d3
long d3(VectTy&& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
If you want to reproduce the copying performance issues, try rolling out your own vector class:
template<class T>
class MyVector
{
private:
std::vector<T> _vec;
public:
MyVector() : _vec()
{}
MyVector(const MyVector& other) : _vec(other._vec)
{}
MyVector& operator=(const MyVector& other)
{
if(this != &other)
this->_vec = other._vec;
return *this;
}
void push_back(T t)
{
this->_vec.push_back(t);
}
};
and use this instead of std::vector, you will for sure hit the performance problem you are looking for
An rvalue of type vector<T> will have its guts stolen by another vector<T> if you try to construct the second vector<T> from it. If you assign, it may have its guts stolen, or its contents may be moved, or something else (it is underspecified in the standard).
Constructing from an identical rvalue type is called move construction. For a vector, (in most implementations) it consists of reading 3 pointers, writing 3 pointers, and clearing 3 pointers. This is a cheap operation, regardless of how much data the vector holds.
Nothing in factory stops NRVO (a kind of elision). Regardless, when you return a local variable (in C++11 that exactly matches the return value types, or in C++14 that finds a compatible rvalue constructor) it is implicitly treated as an rvalue if elision does not occur. So the argument in factory will either be elided with the return value, or have its guts moved. The different in cost is trivial, and any difference could then be optimized away anyhow.
Your three functions d1 d2 and d3 should be better called "by-value", "by-lvalue" and "by-rvalue".
The call L1 has the return value elide into the argument to d1. If this elision fails (say you block it) it becomes a move construct, which is trivially more expensive.
The call L2 forces a copy.
The call L3 has no copy, and neither does L4.
Now, under the as-if rule, you can even skip copies if you can prove it can have no side effect (or, more accurately, if eliminating it is a valid variant under the standard for what could happen). gcc may be doing that, which would be a possible explanation for why L2 is no slower.
The problem with benchmarks of pointless tasks is that, under as-if, once the compiler can prove the task was pointless, it can eliminate it.
But I'm not surprised that L1 L3 and L4 are identical, as the standard mandates that they be basically identical in cost, up to a few pointer shuffles.