Why is passing by const ref slower when using std::async - c++

As an exercise to learn about std::async I wrote a small program that calculates the sum of a large vector<int>, distributed about a lot of threads.
My code below is as follows
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
typedef unsigned long long int myint;
// Calculate sum of part of the elements in a vector
myint partialSum(const std::vector<myint>& v, int start, int end)
{
myint sum(0);
for(int i=start; i<=end; ++i)
{
sum += v[i];
}
return sum;
}
int main()
{
const int nThreads = 100;
const int sizePerThread = 100000;
const int vectorSize = nThreads * sizePerThread;
std::vector<myint> v(vectorSize);
std::vector<std::future<myint>> partial(nThreads);
myint tot = 0;
// Fill vector
for(int i=0; i<vectorSize; ++i)
{
v[i] = i+1;
}
std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
// Start threads
for( int t=0; t < nThreads; ++t)
{
partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);
}
// Sum total
for( int t=0; t < nThreads; ++t)
{
myint ps = partial[t].get();
std::cout << t << ":\t" << ps << std::endl;
tot += ps;
}
std::cout << "Sum:\t" << tot << std::endl;
std::chrono::steady_clock::time_point endTime = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count() <<std::endl;
}
My question is concerned about the calls to the function partialSum, and then especially how the large vector is passed. The function is called as follows:
partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);
with the definition as follows
myint partialSum(const std::vector<myint>& v, int start, int end)
With this approach, the calculation is relatively slow. If I use std::ref(v) in the std::async function call, my function is a lot quicker and more efficient. This still makes sense to me.
However, if I still call by v, instead of std::ref(v), but replace the function with
myint partialSum(std::vector<myint> v, int start, int end)
the program also runs a lot quicker (and uses less memory). I don't understand why the const ref implementation is slower. How does the compiler fix this without any references in place?
With the const ref implementation this program typically takes 6.2 seconds to run, without 3.0. (Note that with const ref, and std::ref it runs in 0.2 seconds for me)
I am compiling with g++ -Wall -pedantic using (adding the -O3 when passing just v demonstrates the same effect)
g++ --version
g++ (Rev1, Built by MSYS2 project) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The short story
given a copy and move-constructible type T
V f(T);
V g(T const&);
T t;
auto v = std::async(f,t).get();
auto v = std::async(g,t).get();
the only relevant difference concerning the two async calls is that, in the first one, t's copy is destroyed as soon as f returns; in the second, t's copy may be destroyed as per effect of the get() call.
If the async calls happen in a loop with the future being get() later, the first will have constant memory on avarage (assuming a constant per-thread workload), the second a linearly growing memory at worst, resulting in more cache hits and worse allocation performance.
The long story
First of all, I can reproduce the observed slowdown (consistently ~2x in my system) on both gcc and clang; moreover, the same code with equivalent std::thread invocations does not manifest the same behavior, with the const& version turning out slightly faster as expected. Let's see why.
Firstly, the specification of async reads:
[futures.async] If launch::async is set in policy, calls INVOKE(DECAY_COPY(std::forward(f)), DECAY_COPY(std::forward(args))...) (23.14.3, 33.3.2.2) as if in a new thread of execution represented by a thread object with the calls to DECAY_COPY() being evaluated in the thread that called async[...]The thread object is stored in the shared state and affects the behavior of any asynchronous return objects that reference that state.
so, async will copy the arguments forwarding those copies to the callable, preserving rvalueness; in this regard, it's the same as of std::thread constructor, and there's no difference in the two OP versions, both copy the vector.
The difference is in the bold part: the thread object is part of the shared state and will not be freed until the latter gets released (eg. by a future::get() call).
Why is this important ? because the standard does not specify to who the decayed copies are bound, we only know that they must outlive the callable invocation, but we don't know if they will be destroyed immediately after the call or at thread exit or when the thread object is destroyed (along with the shared state).
In fact, it turns out that gcc and clang implementations store the decayed copies in the shared state of the resulting future.
Consequently, in the const& version the vector copy is stored in the shared state and destroyed at future::get: this results in the "Start threads" loop allocating a new vector at each step, with a linear growth of memory.
Conversely, in the by-value version the vector copy is moved in the callable argument and destroyed as soon as the callable returns; at future::get, a moved, empty vector will be destroyed. So, if the callable is fast enough to destroy the vector before a new one is created, the same vector will be allocated over and over and memory will stay almost constant. This will result in less cache hits and faster allocations, explaining the improved timings.

As people said, without std::ref the object is being copied.
Now I believe that the reason that passing by value is actually faster might have something to do with the following question: Is it better in C++ to pass by value or pass by constant reference?
What might happen, that in the internal implementation of async, the vector being copied once to the new thread. And then internally being passed by reference, to a function which take ownership of the vector, which mean it will than be copied once again. On the other hand, if you pass it by value, it will copy it once to the new thread, but than will move it twice inside the new thread. Resulting in 2 copies if the object is being passed by reference, and 1 copy and 2 move in the second case if the object is passed by value.

Related

Does this multithreaded list processing code have enough synchronization?

I have this code in test.cpp:
#include <atomic>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>
static const int N_ITEMS = 11;
static const int N_WORKERS = 4;
int main(void)
{
int* const items = (int*)std::malloc(N_ITEMS * sizeof(*items));
for (int i = 0; i < N_ITEMS; ++i) {
items[i] = i;
}
std::thread* const workers = (std::thread*)std::malloc(N_WORKERS * sizeof(*workers));
std::atomic<int> place(0);
for (int w = 0; w < N_WORKERS; ++w) {
new (&workers[w]) std::thread([items, &place]() {
int i;
while ((i = place.fetch_add(1, std::memory_order_relaxed)) < N_ITEMS) {
items[i] *= items[i];
std::this_thread::sleep_for(std::chrono::seconds(1));
}
});
}
for (int w = 0; w < N_WORKERS; ++w) {
workers[w].join();
workers[w].~thread();
}
std::free(workers);
for (int i = 0; i < N_ITEMS; ++i) {
std::cout << items[i] << '\n';
}
std::free(items);
}
I compile like so on linux:
c++ -std=c++11 -Wall -Wextra -pedantic test.cpp -pthread
When run, the program should print this:
0
1
4
9
16
25
36
49
64
81
100
I don't have much experience with C++, nor do I really understand atomic operations. According to the standard, will the worker threads always square item values correctly, and will the main thread always print the correct final values? I'm worried that the atomic variable will be updated out-of-sync with the item values, or something. If this is the case, can I change the memory order used with fetch_add to fix the code?
Looks safe to me. Your i = place.fetch_add(1) allocates each array index exactly once, each one to one of the threads. So for any given array element, it's only touched by a single thread, and that's guaranteed to be safe for all types other than bit-fields of a struct1.
Footnote 1: Or elements of std::vector<bool> which the standard unfortunately requires to be a packed bit-vector, breaking some of the usual guarantees for std::vector.
There's no need for any ordering of these accesses while the worker threads are working; the main thread join()s the workers before reading the array, so everything done by the workers "happens before" (in ISO C++ standardese) the main thread's std::cout << items[i] accesses.
Of course, the array elements are all written by the main thread before the worker threads are started, but that's also safe because the std::thread constructor makes sure all earlier stuff in the parent threads happens-before anything in the new thread
The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.
There's also no need for any order stronger than mo_relaxed on the increment: it's the only atomic variable in your program, and you don't need any ordering of any of your operations except the overall thread creation and join.
It's still atomic, so it's guaranteed that 100 increments will produce the numbers 0..99, simply no guarantee about which thread gets which. (But there is a guarantee that each thread will see monotonically increasing values: for a every atomic object separately, a modification order exists, and that order is consistent with some interleaving of the program-order of modifications to it.)
Just for the record, this is hilariously inefficient compared to having each worker pick a contiguous range of indices to square. That would only take 1 atomic access per thread, or the main thread could just pass them positions.
And it would avoid all the false sharing effects of having 4 threads loading and storing into the same cache line at the same time as they move through the array.
Contiguous ranges would also let the compiler auto-vectorize with SIMD to load/square/store multiple elements at once.

Why is my parallel foreach loop implementation slower than the single-threaded one?

I am trying to implement the parallel foreach loop for std::vector which runs the computations in optimal number of threads (number of cores minus 1 for main thread), however, my implementation seems to be not fast enough – it actually runs 6 times slower than the single-threaded one!
The thread instantiation is often blamed for being a bottleneck so I tried a larger vector, however, that did not seem to help.
I am currently stuck watching the parallel algorithm executed in 13000-20000 microseconds in a separate thread while single-threaded one is executed in 120-200 microseconds in the main thread and cannot figure out what I am doing wrong. Out of those 13-20 ms parallel algorithm runs for 8 or 9 are usually utilized to create thread, however, I can still see no reason for std::for_each running through 1/3 of the vector in a separate thread for several times longer than another std::for_each need to iterate through the whole vector.
#include <iostream>
#include <vector>
#include <thread>
#include <algorithm>
#include <chrono>
const unsigned int numCores = std::thread::hardware_concurrency();
const size_t numUse = numCores - 1;
struct foreach
{
inline static void go(std::function<void(uint32_t&)>&& func, std::vector<uint32_t>& cont)
{
std::vector<std::thread> vec;
vec.reserve(numUse);
std::vector<std::vector<uint32_t>::iterator> arr(numUse + 1);
size_t distance = cont.size() / numUse;
for (size_t i = 0; i < numUse; i++)
arr[i] = cont.begin() + i * distance;
arr[numUse] = cont.end();
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
vec.emplace_back([&] { std::for_each(cont.begin() + (numUse - 1) * distance, cont.end(), func); });
for (auto &d : vec)
{
d.join();
}
}
};
int main()
{
std::chrono::steady_clock clock;
std::vector<uint32_t> numbers;
for (size_t i = 0; i < 50000000; i++)
numbers.push_back(i);
std::chrono::steady_clock::time_point t0m = clock.now();
std::for_each(numbers.begin(), numbers.end(), [](uint32_t& value) { ++value; });
std::chrono::steady_clock::time_point t1m = clock.now();
std::cout << "Single-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1m - t0m).count() << "mcs\n";
std::chrono::steady_clock::time_point t0s = clock.now();
foreach::go([](uint32_t& i) { ++i; }, numbers);
std::chrono::steady_clock::time_point t1s = clock.now();
std::cout << "Multi-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1s - t0s).count() << "mcs\n";
getchar();
}
Is there a way I can optimize this and increase the performance?
The compiler I am using is Visual Studio 2017's one. Config is Release x86. I have also been advised to use a profiler and am currently figuring out how to use one.
I actually managed to get parallel code run faster than the regular one, however, this required vector of dozens of thousands of vectors of five elements. If anyone has advices on how to improve performance or where can I find better implementation to check its structure, that would be appreciated.
Thank you for providing some example code.
Getting good metrics (especially on parallel code) can be pretty tricky. Your metrics are tainted.
Use high_resolution_clock instead of steady_clock for profiling.
Don't include the thread startup time in your timing measurement. Thread launch/join is orders of magnitude longer than your actual work here. You should create the threads once and use condition variables to make them sleep until you signal them to work. This is not trivial, but it is essential that you don't measure the thread startup time.
Visual Studio has a profiler. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Another issue kind of related to having good metrics is that your "work" is just incrementing an integer. Is that really representative of the work your program is going to be doing? Increment is really fast. If you look at the assembly generated by your sequential version, everything gets inlined into a really short loop.
Lambdas have a very good chance of being inlined. But in your go function, you're casting the lambda to std::function. std::function has a very poor chance of being inlined.
So if you want to keep the chance of getting the lambda inlined, you have to do some template tricks:
template <typename FUNC>
inline static void go(FUNC&& func, std::vector<uint32_t>& cont)
By manually inlining your code (I moved the contents of the go function to main) and doing step 2 above, I was able to get the parallel version (4 threads on a hyperthreaded dual-core) to run in about 75% of the time. That's not particularly good scaling, but it's not bad considering that the original was already pretty fast. For a further optimization, I would use SIMD aka "vector" (different from std::vector except in the sense that they both relate to arrays) operations which will apply the increment to multiple array elements in one iteration.
You have a race condition here:
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
because you set the default lambda capture to capture-by-reference, the i variable is a reference and that could cause some threads to check the wrong range or too long of a range. You could do this: [&, i], but why risk shooting yourself in the foot again? Scott Meyers recommends against using default capture modes. Just do [&cont, &distance, &func, i]
UPDATE:
I think it's a fine idea to move your foreach to its own space. I think what you should do is separate the thread creation from task dispatch. That means you need some kind of signaling system (generally condition variables). You could look into thread pools.
An easy way to add threadpools is to use OpenMP, which Visual Studio 2017 has support for (OpenMP 2.0). A caveat is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section (it's implementation dependent). So it trades off performance with ease of use.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each
As for using std::function you can use it, you just don't want your basic operation (the one that will be called 50,000,000 times) to be a std::function.
Bad:
void go(std::function<...>& func)
{
std::thread t(std::for_each(v.begin(), v.end(), func));
...
}
...
go([](int& i) { ++i; });
Good:
void go(std::function<...>& func)
{
std::thread t(func);
...
}
...
go([&v](){ std::for_each(v.begin(), v.end(), [](int& i) { ++i; })});
In the good version, the short inner lambda (i.e. ++i) gets inlined in the call to for_each. That's important because it gets called 50 million times. The call to the bigger lambda is not inlined (because it's converted to std::function) but that's ok because it only gets called once per thread.

Benchmarking adding elements to vector when size is known

I have made a tiny benchmark for adding new elements to vector which I know its size.
Code:
struct foo{
foo() = default;
foo(double x, double y, double z) :x(x), y(y), z(y){
}
double x;
double y;
double z;
};
void resize_and_index(){
std::vector<foo> bar(1000);
for (auto& item : bar){
item.x = 5;
item.y = 5;
item.z = 5;
}
}
void reserve_and_push(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(foo(5, 5, 5));
}
}
void reserve_and_push_move(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(std::move(foo(5, 5, 5)));
}
}
void reserve_and_embalce(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.emplace_back(5, 5, 5);
}
}
I have then call each method 100000 times.
results:
resize_and_index: 176 mSec
reserve_and_push: 560 mSec
reserve_and_push_move: 574 mSec
reserve_and_embalce: 143 mSec
Calling code:
const size_t repeate = 100000;
auto start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
resize_and_index();
}
auto stop_time = clock();
std::cout << "resize_and_index: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push();
}
stop_time = clock();
std::cout << "reserve_and_push: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push_move();
}
stop_time = clock();
std::cout << "reserve_and_push_move: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_embalce();
}
stop_time = clock();
std::cout << "reserve_and_embalce: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
My questions:
Why did I get these results? what make emplace_back superior to others?
Why does std::move make the performance slightly worse ?
Benchmarking conditions:
Compiler: VS.NET 2013 C++ compiler (/O2 Max speed Optimization)
OS : Windows 8
Processor: Intel Core i7-410U CPU # 2.00 GHZ
Another Machine (By horstling):
VS2013, Win7, Xeon 1241 # 3.5 Ghz
resize_and_index: 144 mSec
reserve_and_push: 199 mSec
reserve_and_push_move: 201 mSec
reserve_and_embalce: 111 mSec
First, reserve_and_push and reserve_and_push_move are semantically equivalent. The temporary foo you construct is already an rvalue (the rvalue reference overload of push_back is already used); wrapping it in a move does not change anything, except possibly obscure the code for the compiler, which could explain the slight performance loss. (Though I think it more likely to be noise.) Also, your class has identical copy and move semantics.
Second, the resize_and_index variant might be more optimal if you write the loop's body as
item = foo(5, 5, 5);
although only profiling will show that. The point is that the compiler might generate suboptimal code for the three separate assignments.
Third, you should also try this:
std::vector<foo> v(100, foo(5, 5, 5));
Fourth, this benchmark is extremely sensitive to the compiler realizing that none of these functions actually do anything and simply optimizing their complete bodies out.
Now for analysis. Note that if you really want to know what's going on, you'll have to inspect the assembly the compiler generates.
The first version does the following:
Allocate space for 1000 foos.
Loop and default-construct each one.
Loop over all elements and reassign the values.
The main question here is whether the compiler realizes that the constructor in the second step is a no-op and that it can omit the entire loop. Assembly inspection can show that.
The second and third versions do the following:
Allocate space for 1000 foos.
1000 times:
construct a temporary foo object
ensure there is still enough allocated space
move (for your type, equivalent to a copy, since your class doesn't have special move semantics) the temporary into the allocated space.
Increment the vector's size.
There is a lot of room for optimization here for the compiler. If it inlines all operations into the same function, it could realize that the size check is superfluous. It could then realize that your move constructor cannot throw, which means the entire loop is uninterruptible, which means it could merge all the increments into one assignment. If it doesn't inline the push_back, it has to place the temporary in memory and pass a reference to it; there's a number of ways it could special-case this to be more efficient, but it's unlikely that it will.
But unless the compiler does some of these, I would expect this version to be a lot slower than the others.
The fourth version does the following:
Allocate enough space for 1000 foos.
1000 times:
ensure there is still enough allocated space
create a new object in the allocated space, using the constructor with the three arguments
increment the size
This is similar to the previous, with two differences: first, the way the MS standard library implements push_back, it has to check whether the passed reference is a reference into the vector itself; this greatly increases complexity of the function, inhibiting inlining. emplace_back does not have this problem. Second, emplace_back gets three simple scalar arguments instead of a reference to a stack object; if the function doesn't get inlined, this is significantly more efficient to pass.
Unless you work exclusively with Microsoft's compiler, I would strongly recommend you compare with other compilers (and their standard libraries). I also think that my suggested version would beat all four of yours, but I haven't profiled this.
In the end, unless the code is really performance-sensitive, you should write the version that is most readable. (That's another place where my version wins, IMO.)
Why did I get these results? what make emplace_back superior to
others?
You got these results because you benchmarked it and you had to get some results :).
Emplace back in this case is doing better because its directly creating/constructing the object at the memory location reserved by the vector. So, it does not have to first create an object (temp maybe) outside then copy/move it to the vector's reserved location thereby saving some overhead.
Why does std::move make the performance slightly worse ?
If you are asking why its more costly than emplace then it would be because it has to 'move' the object. In this case the move operation could have been very well reduced to copy. So, it must be the copy operation that is taking more time, since this copy was not happening for the emplace case.
You can try digging the assembly code generated and see what exactly is happening.
Also, I dont think comparing the rest of the functions against 'resize_and_index' is fair. There is a possibility that objects are being instantiated more than once in other cases.
I am not sure if discrepancy between reserve_and_push and reserve_and_push_move is just noise. I did a simple test using g++ 4.8.4 and noticed increase in executable size/additional assembly instructions, even though theoretically in this case the std::move can be ignored by the complier.

Memory allocation for return value of a function in a loop in C++11: how does it optimize?

I'm in the mood for some premature optimization and was wondering the following.
If one has a for-loop, and inside that loop there is a call to a function that returns a container, say a vector, of which the value is caught as an rvalue into a variable in the loop using move semantics, for instance:
std::vector<any_type> function(int i)
{
std::vector<any_type> output(3);
output[0] = i;
output[1] = i*2;
output[2] = i-3;
return(output);
}
int main()
{
for (int i = 0; i < 10; ++i)
{
// stuff
auto value = function(i);
// do stuff with value ...
// ... but in such a way that it can be discarded in the next iteration
}
}
How do compilers handle this memory-wise in the case that move semantics are applied (and that the function will not be inlined)? I would imagine that the most efficient thing to do is to allocate a single piece of memory for all the values, both inside the function and outside in the for-loop, that will get overwritten in each iteration.
I am mainly interested in this, because in my real-life application the vectors I'm creating are a lot larger than in the example given here. I am concerned that if I use functions like this, the allocation and destruction process will take up a lot of useless time, because I already know that I'm going to use that fixed amount of memory a lot of times. So, what I'm actually asking is whether there's some way that compilers would optimize to something of this form:
void function(int i, std::vector<any_type> &output)
{
// fill output
}
int main()
{
std::vector<any_type> dummy; // allocate memory only once
for (int i = 0; i < 10; ++i)
{
// stuff
function(i, dummy);
// do stuff with dummy
}
}
In particular I'm interested in the GCC implementation, but would also like to know what, say, the Intel compiler does.
Here, the most predictable optimization is RVO. When a function return an object, if it is used to initialize a new variable, the compiler can elide additional copy and move to construct directly on the destination ( it means that a program can contains two versions of the function depending on the use case ).
Here, you will still pay for allocating and destroying a buffer inside the vector at each loo iteration. If it is unacceptable, you will have to rely on an other solution, like std::array as your function seems to use fixed size dimension or move the vector before the loop and reuse it.
I would imagine that the most efficient thing to do is to allocate a
single piece of memory for all the values, both inside the function
and outside in the for-loop, that will get overwritten in each
iteration.
I don't think that any of the current compilers can do that. (I would be stunned to see that.) If you want to get insights, watch Chandler Carruth's talk.
If you need this kind of optimization, you need to do it yourself: Allocate the vector outside the loop and pass it by non-const reference to function() as argument. Of course, don't forget to call clear() when you are done or call clear() first inside function().
All this has nothing to do with move semantics, nothing has changed with C++11 in this respect.
If your loop is a busy loop, than allocating a container in each iteration can cost you a lot. It's easier to find yourself in such a situation than you would probably expect. Andrei Alexandrescu presents an example in his talk Writing Quick Code in C++, Quickly. The surprising thing is that doing unnecessary heap allocations in a tight loop like the one in his example can be slower than the actual file IO. I was surprised to see that. By the way, the container was std::string.

Auto in loop and optimizations

Can you explain me why there is such difference in computation time with the following codes (not optimized). I suspect RVO vs move-construction but I'm not really sure.
In general, what is the best practice when encountering such case ? Is auto declaration in a loop considered as a bad practice when initializing non-POD data ?
Using auto inside the loop :
std::vector<int> foo()
{
return {1,2,3,4,5};
}
int main()
{
for (size_t i = 0; i < 1000000; ++i)
auto f = foo();
return 0;
}
Output :
./a.out 0.17s user 0.00s system 97% cpu 0.177 total
Vector instance outside the loop :
std::vector<int> foo()
{
return {1,2,3,4,5};
}
int main()
{
std::vector<int> f;
for (size_t i = 0; i < 1000000; ++i)
f = foo();
return 0;
}
Output :
./a.out 0.32s user 0.00s system 99% cpu 0.325 total
I suspect RVO vs move-construction but I'm not really sure.
Yes, that is almost certainly what's happening. The first case move-initialises a variable from the function's return value: in this case, the move can be elided by making the function initialise it in place. The second case move-assigns from the return value; assignments can't be elided. I believe GCC performs elision even at optimisation level zero, unless you explicitly disable it.
In the final case (with -O3, which has now been removed from the question) the compiler probably notices that the loop has no side effects, and removes it entirely.
You might (or might not) get a more useful benchmark by declaring the vector volatile and compiling with optimisation. This will force the compiler to actually create/assign it on each iteration, even if it thinks it knows better.
Is auto declaration in a loop considered as a bad practice when initializing non-POD data ?
No; if anything, it's considered better practice to declare things in the narrowest scope that's needed. So if it's only needed in the loop, declare it in the loop. In some circumstances, you may get better performance by declaring a complicated object outside a loop to avoid recreating it on each iteration; but only do that when you're sure that the performance benefit (a) exists and (b) is worth the loss of locality.
I don't see your example having anything to do with auto. You wrote two different programs.
While
for (size_t i = 0; i < 1000000; ++i)
auto f = foo();
is equivalent to
for (size_t i = 0; i < 1000000; ++i)
std::vector<int> f = foo();
-- which means, you create a new vector (and destroying the old one). And, yes, in your foo-implementation using RVO, but that is not the point here: You still create a new vector in the place where the outer loop is making room for f.
The snippet
std::vector<int> f;
for (size_t i = 0; i < 1000000; ++i)
f = foo();
uses assign to an existing vector. And, yes, with RVO it may become a move-assign, depending on foo, and it is in your case, so you can expect it to be fast. But it still is a different thing -- it is always the one f that is in charge in managing the resources.
But what you do show very beautifully here is that it often makes sense to follow the general rule
Declare variables as close to their use as possible.
See this Discussion