Benchmarking adding elements to vector when size is known - c++

I have made a tiny benchmark for adding new elements to vector which I know its size.
Code:
struct foo{
foo() = default;
foo(double x, double y, double z) :x(x), y(y), z(y){
}
double x;
double y;
double z;
};
void resize_and_index(){
std::vector<foo> bar(1000);
for (auto& item : bar){
item.x = 5;
item.y = 5;
item.z = 5;
}
}
void reserve_and_push(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(foo(5, 5, 5));
}
}
void reserve_and_push_move(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(std::move(foo(5, 5, 5)));
}
}
void reserve_and_embalce(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.emplace_back(5, 5, 5);
}
}
I have then call each method 100000 times.
results:
resize_and_index: 176 mSec
reserve_and_push: 560 mSec
reserve_and_push_move: 574 mSec
reserve_and_embalce: 143 mSec
Calling code:
const size_t repeate = 100000;
auto start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
resize_and_index();
}
auto stop_time = clock();
std::cout << "resize_and_index: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push();
}
stop_time = clock();
std::cout << "reserve_and_push: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push_move();
}
stop_time = clock();
std::cout << "reserve_and_push_move: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_embalce();
}
stop_time = clock();
std::cout << "reserve_and_embalce: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
My questions:
Why did I get these results? what make emplace_back superior to others?
Why does std::move make the performance slightly worse ?
Benchmarking conditions:
Compiler: VS.NET 2013 C++ compiler (/O2 Max speed Optimization)
OS : Windows 8
Processor: Intel Core i7-410U CPU # 2.00 GHZ
Another Machine (By horstling):
VS2013, Win7, Xeon 1241 # 3.5 Ghz
resize_and_index: 144 mSec
reserve_and_push: 199 mSec
reserve_and_push_move: 201 mSec
reserve_and_embalce: 111 mSec

First, reserve_and_push and reserve_and_push_move are semantically equivalent. The temporary foo you construct is already an rvalue (the rvalue reference overload of push_back is already used); wrapping it in a move does not change anything, except possibly obscure the code for the compiler, which could explain the slight performance loss. (Though I think it more likely to be noise.) Also, your class has identical copy and move semantics.
Second, the resize_and_index variant might be more optimal if you write the loop's body as
item = foo(5, 5, 5);
although only profiling will show that. The point is that the compiler might generate suboptimal code for the three separate assignments.
Third, you should also try this:
std::vector<foo> v(100, foo(5, 5, 5));
Fourth, this benchmark is extremely sensitive to the compiler realizing that none of these functions actually do anything and simply optimizing their complete bodies out.
Now for analysis. Note that if you really want to know what's going on, you'll have to inspect the assembly the compiler generates.
The first version does the following:
Allocate space for 1000 foos.
Loop and default-construct each one.
Loop over all elements and reassign the values.
The main question here is whether the compiler realizes that the constructor in the second step is a no-op and that it can omit the entire loop. Assembly inspection can show that.
The second and third versions do the following:
Allocate space for 1000 foos.
1000 times:
construct a temporary foo object
ensure there is still enough allocated space
move (for your type, equivalent to a copy, since your class doesn't have special move semantics) the temporary into the allocated space.
Increment the vector's size.
There is a lot of room for optimization here for the compiler. If it inlines all operations into the same function, it could realize that the size check is superfluous. It could then realize that your move constructor cannot throw, which means the entire loop is uninterruptible, which means it could merge all the increments into one assignment. If it doesn't inline the push_back, it has to place the temporary in memory and pass a reference to it; there's a number of ways it could special-case this to be more efficient, but it's unlikely that it will.
But unless the compiler does some of these, I would expect this version to be a lot slower than the others.
The fourth version does the following:
Allocate enough space for 1000 foos.
1000 times:
ensure there is still enough allocated space
create a new object in the allocated space, using the constructor with the three arguments
increment the size
This is similar to the previous, with two differences: first, the way the MS standard library implements push_back, it has to check whether the passed reference is a reference into the vector itself; this greatly increases complexity of the function, inhibiting inlining. emplace_back does not have this problem. Second, emplace_back gets three simple scalar arguments instead of a reference to a stack object; if the function doesn't get inlined, this is significantly more efficient to pass.
Unless you work exclusively with Microsoft's compiler, I would strongly recommend you compare with other compilers (and their standard libraries). I also think that my suggested version would beat all four of yours, but I haven't profiled this.
In the end, unless the code is really performance-sensitive, you should write the version that is most readable. (That's another place where my version wins, IMO.)

Why did I get these results? what make emplace_back superior to
others?
You got these results because you benchmarked it and you had to get some results :).
Emplace back in this case is doing better because its directly creating/constructing the object at the memory location reserved by the vector. So, it does not have to first create an object (temp maybe) outside then copy/move it to the vector's reserved location thereby saving some overhead.
Why does std::move make the performance slightly worse ?
If you are asking why its more costly than emplace then it would be because it has to 'move' the object. In this case the move operation could have been very well reduced to copy. So, it must be the copy operation that is taking more time, since this copy was not happening for the emplace case.
You can try digging the assembly code generated and see what exactly is happening.
Also, I dont think comparing the rest of the functions against 'resize_and_index' is fair. There is a possibility that objects are being instantiated more than once in other cases.

I am not sure if discrepancy between reserve_and_push and reserve_and_push_move is just noise. I did a simple test using g++ 4.8.4 and noticed increase in executable size/additional assembly instructions, even though theoretically in this case the std::move can be ignored by the complier.

Related

Does pass by reference and pass by value cause a change in time complexity of a program in C++?

In Java I never really had to worry about thinking if the argument is being passed by reference because it is not possible. In C++ though both ways to pass arguments either by value or reference are possible, would either way passing the argument have an effect on the time complexity analysis through big O notation? How would one determine the difference or change when the arguments are being passed by reference or value when calculating the big O notation?I have been finding some people say yes while others say no, is there a clear answer to this?
There is no a yes-no answer for this question, since sometimes converting one to the other may not work.
For explaining what happen, I take two simple examples for two directions.
From pass-by-reference to pass-by-value
In this direction, the answer is yes. e.g.
bool binary_search(std::vector<int> &arr, int key)
{
return std::binary_search(arr.begin(),arr.end(),key);
}
This is a binary search function in a vector, and the complexity is O(log n) where n is arr.size().
But if we modify it to pass-by-value like:
bool binary_search(std::vector<int> arr, int key)
{
return std::binary_search(arr.begin(),arr.end(),key);
}
The complexity become O(n) since the function should copy the arr.
From pass-by-value to pass-by-reference
In this direction, the conversion may not work. e.g.
std::vector<bool> batch_search(std::vector<int> arr, std::vector<int> keys)
{
std::sort(arr.begin(), arr.end());
std::vector<bool> res;
for(auto key:keys)
{
res.push_back(std::binary_search(arr.begin(),arr.end(),key));
}
return res;
}
This is a batch search function. The function frist sort a copy of the vector, and search for each key. The complexity is O(n log n + m log n) where m is keys.size().
As you can see, the function sort the arr before using it. So directly converting arr from pass-by-reference to pass-by-value will not work.
One way to convert to pass-by-reference is like:
std::vector<bool> batch_search(std::vector<int> &arr, std::vector<int> keys)
{
std::vector<int> copy_arr(arr);
std::sort(copy_arr.begin(), copy_arr.end());
std::vector<bool> res;
for(auto key:keys)
{
res.push_back(std::binary_search(copy_arr.begin(),copy_arr.end(),key));
}
return res;
}
Just copy it, and sort the copy since what you need is the valus of the vector. Or in some sense, it's implement the pass-by-value using pass-by-referance.
Or an other way is like:
std::vector<bool> batch_search(std::vector<int> &arr, std::vector<int> keys)
{
std::vector<bool> res;
for(auto key:keys)
{
res.push_back(std::find(arr.begin(),arr.end(),key)!=arr.end());
}
return res;
}
This way change the algorithm to avoid sort, and the complexity is O(n*m).
But both the two approachs does not just convert the param, but rewrite the batch_search. When discussing pass-by-reference and pass-by-value, it seems some function implement by pass-by-reference cannot be directly converted to pass-by-value.
Is there a difference in the overall time complexity of an algorithm in C++ when using pass by value versus pass by reference?
This depends on how technical you want to get. The actual time complexity of the algorithm itself does not change. If a said sorting or searching algorithm is expressed as O(n) or O(log n) then that algorithm will always have that property.
As for the structure of your code in C++ that can vary on a wide variety of things. Such as your current hardware, Motherboard and CPU combo, the type of cache you have, how many cores and threads it has, your ram, etc... The operating system you are using, what kind of background processes are running. Your current compiler, what kind of optimizations you are using, are you multithreading or writing parallel programming source code, are you utilizing MMX registers, etc... There are a lot of factors that come into play.
Will your execution time change? Slightly, but almost insignificantly. Here's a demo application of sorting a vector that has a random size between [50k,100k] elements where their values range from [1,10k].
The output will vary each time you run this as a different size vector will be generated, but this is to demonstrate the same algorithm being performed both ways and to show the minimal time difference between the two: I write the values of both vectors before and after sorting to a file just to verify that they are the same before being sorted and that they were actually sorted. I only display the time executions to the console.
#include <algorithm>
#include <exception>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <random>
template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
using Clock = std::conditional_t < std::chrono::high_resolution_clock::is_steady,
std::chrono::high_resolution_clock,
std::chrono::steady_clock>;
private:
const Clock::time_point mStart = Clock::now();
public:
ExecutionTimer() = default;
~ExecutionTimer() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Destructor Elapsed: "
<< std::chrono::duration_cast<Resolution>(end - mStart).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
inline void stop() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Stop Elapsed: "
<< std::chrono::duration_cast<Resolution>(end - mStart).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
};
std::vector<uint32_t> sortVectorByValue(std::vector<uint32_t> values) {
std::vector<uint32_t> temp{ values };
std::sort(temp.begin(), temp.end());
return temp;
}
void sortVectorByReference(std::vector<uint32_t>& values) {
std::sort(values.begin(), values.end());
}
void writeVectorToFile(std::ofstream& file, std::vector<uint32_t>& values) {
int count = 0;
for (auto& v : values) {
if (count % 15 == 0)
file << '\n';
file << std::setw(5) << v << ' ';
count++;
}
file << "\n\n";
}
int main() {
try {
std::random_device rd;
std::mt19937 gen{ rd() };
std::uniform_int_distribution<uint32_t> distSize(50000, 100000);
std::uniform_int_distribution<uint32_t> distRange(1, 10000);
std::vector<uint32_t> values;
values.resize(distSize(gen));
for (auto& v : values)
v = distRange(gen);
std::ofstream file;
file.open("random numbers.txt");
file << "values:\n";
writeVectorToFile(file, values);
std::vector<uint32_t> values2{ values };
file << "values2:\n";
writeVectorToFile(file, values2);
// Using a local block scope to cause the Execution Timer to call its destructor
{
std::cout << "Evaluated Execution Time for Pass By Value of Sorting " << values.size() << " Elements:\n";
ExecutionTimer timer;
values = sortVectorByValue(values);
timer.stop();
}
{
std::cout << "Evaluated Execution Time for Pass By Reference of Sorting " << values2.size() << " Elements:\n";
ExecutionTimer timer;
sortVectorByReference(values2);
timer.stop();
}
file << "values1:\n";
writeVectorToFile(file, values);
file << "values2:\n";
writeVectorToFile(file, values2);
file.close();
} catch (const std::exception& e) {
std::cerr << e.what() << std::endl;
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
I have an Intel Core 2 Quad Extreme 3.0Ghz with 8GB Ram running Windows 7 64bit. I'm using Visual Studio 2017 with C++17. I'm compiling everything in x64 mode.
Here are some of my random outputs in Debug Mode:
Trial 1
Evaluated Execution Time for Pass By Value of Sorting 93347 Elements:
Stop Elapsed: 247
Destructor Elapsed: 247
Evaluated Execution Time for Pass By Reference of Sorting 93347 Elements:
Stop Elapsed: 247
Destructor Elapsed: 247
Trial 2
Evaluated Execution Time for Pass By Value of Sorting 58782 Elements:
Stop Elapsed: 174
Destructor Elapsed: 174
Evaluated Execution Time for Pass By Reference of Sorting 58782 Elements:
Stop Elapsed: 172
Destructor Elapsed: 172
Trial 3
Evaluated Execution Time for Pass By Value of Sorting 67137 Elements:
Stop Elapsed: 194
Destructor Elapsed: 194
Evaluated Execution Time for Pass By Reference of Sorting 67137 Elements:
Stop Elapsed: 191
Destructor Elapsed: 191
Here are some trials in release mode with optimizations set to /O2
Trial 1
Evaluated Execution Time for Pass By Value of Sorting 61078 Elements:
Stop Elapsed: 4
Destructor Elapsed: 5
Evaluated Execution Time for Pass By Reference of Sorting 61078 Elements:
Stop Elapsed: 4
Destructor Elapsed: 4
Trial 2
Evaluated Execution Time for Pass By Value of Sorting 87909 Elements:
Stop Elapsed: 6
Destructor Elapsed: 6
Evaluated Execution Time for Pass By Reference of Sorting 87909 Elements:
Stop Elapsed: 6
Destructor Elapsed: 6
Trial 3
Evaluated Execution Time for Pass By Value of Sorting 93007 Elements:
Stop Elapsed: 7
Destructor Elapsed: 8
Evaluated Execution Time for Pass By Reference of Sorting 93007 Elements:
Stop Elapsed: 9
Destructor Elapsed: 9
All times are measured in milliseconds. It is safe to say that with the std::sort algorithm there is very minimal difference in passing by value as by reference. As you can see from the output above, even the initialization of values to the temp vector and the returning of the copy within the by-value version has very little overhead compared to its by-reference counterpart.
Yes, there is a little more work to do, but the leading factor in complexity is the term of the polynomial of the highest order. Performing a copy may not always be that expensive, especially with the optimization tricks that modern compilers will use, and how modern CPU's can utilize their cache as well as their vectorized memory registers...
I think you should be more concerned with choosing the right algorithm for the right problem and designing appropriate data structures to have proper memory alignment for better cache-hit performance than worrying about the minor semantics of pass by value versus pass by reference. Sometimes you will want to pass by value and others you will want to pass by reference. It's more of a matter of knowing when to use which feature based on the context of the problem at hand.
If a function needs a value from outside but doesn't change that outside value and later parts of your code don't require it to be updated, then pass by value... If a function requires a state of value but will change it and you will need to use it later after the function call, then pass by reference.
Also when passing containers of large size, passing by reference is normally the preferred choice... the demo that I showed only had up to 100k elements. What if a container had over 3 billion? Would you want to have to copy all 3 billion elements? Probably not. So when you have extremely large containers, it's better to pass by reference if you need to modify the contents of that container. If you only need to reference it to perform calculations for other variables within the scope of that local function, then pass by const reference.
Overall, does it change the time complexity of the algorithm? I'd say no it doesn't! Why?
Because O(n^2 + 10n) is still considered just O(n^2) and O(n + 10000) is still considered just O(n), etc.
Now on the other hand, if the object being copied is complex and requires a bunch of resources, dynamic memory allocations, etc... then yes this can change the complexity of the algorithm, but for anything that is considered RAII that is default constructible, trivially destructible, and even movable, then no it doesn't!

Why is passing by const ref slower when using std::async

As an exercise to learn about std::async I wrote a small program that calculates the sum of a large vector<int>, distributed about a lot of threads.
My code below is as follows
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
typedef unsigned long long int myint;
// Calculate sum of part of the elements in a vector
myint partialSum(const std::vector<myint>& v, int start, int end)
{
myint sum(0);
for(int i=start; i<=end; ++i)
{
sum += v[i];
}
return sum;
}
int main()
{
const int nThreads = 100;
const int sizePerThread = 100000;
const int vectorSize = nThreads * sizePerThread;
std::vector<myint> v(vectorSize);
std::vector<std::future<myint>> partial(nThreads);
myint tot = 0;
// Fill vector
for(int i=0; i<vectorSize; ++i)
{
v[i] = i+1;
}
std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
// Start threads
for( int t=0; t < nThreads; ++t)
{
partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);
}
// Sum total
for( int t=0; t < nThreads; ++t)
{
myint ps = partial[t].get();
std::cout << t << ":\t" << ps << std::endl;
tot += ps;
}
std::cout << "Sum:\t" << tot << std::endl;
std::chrono::steady_clock::time_point endTime = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count() <<std::endl;
}
My question is concerned about the calls to the function partialSum, and then especially how the large vector is passed. The function is called as follows:
partial[t] = std::async( std::launch::async, partialSum, v, t*sizePerThread, (t+1)*sizePerThread -1);
with the definition as follows
myint partialSum(const std::vector<myint>& v, int start, int end)
With this approach, the calculation is relatively slow. If I use std::ref(v) in the std::async function call, my function is a lot quicker and more efficient. This still makes sense to me.
However, if I still call by v, instead of std::ref(v), but replace the function with
myint partialSum(std::vector<myint> v, int start, int end)
the program also runs a lot quicker (and uses less memory). I don't understand why the const ref implementation is slower. How does the compiler fix this without any references in place?
With the const ref implementation this program typically takes 6.2 seconds to run, without 3.0. (Note that with const ref, and std::ref it runs in 0.2 seconds for me)
I am compiling with g++ -Wall -pedantic using (adding the -O3 when passing just v demonstrates the same effect)
g++ --version
g++ (Rev1, Built by MSYS2 project) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The short story
given a copy and move-constructible type T
V f(T);
V g(T const&);
T t;
auto v = std::async(f,t).get();
auto v = std::async(g,t).get();
the only relevant difference concerning the two async calls is that, in the first one, t's copy is destroyed as soon as f returns; in the second, t's copy may be destroyed as per effect of the get() call.
If the async calls happen in a loop with the future being get() later, the first will have constant memory on avarage (assuming a constant per-thread workload), the second a linearly growing memory at worst, resulting in more cache hits and worse allocation performance.
The long story
First of all, I can reproduce the observed slowdown (consistently ~2x in my system) on both gcc and clang; moreover, the same code with equivalent std::thread invocations does not manifest the same behavior, with the const& version turning out slightly faster as expected. Let's see why.
Firstly, the specification of async reads:
[futures.async] If launch::async is set in policy, calls INVOKE(DECAY_COPY(std::forward(f)), DECAY_COPY(std::forward(args))...) (23.14.3, 33.3.2.2) as if in a new thread of execution represented by a thread object with the calls to DECAY_COPY() being evaluated in the thread that called async[...]The thread object is stored in the shared state and affects the behavior of any asynchronous return objects that reference that state.
so, async will copy the arguments forwarding those copies to the callable, preserving rvalueness; in this regard, it's the same as of std::thread constructor, and there's no difference in the two OP versions, both copy the vector.
The difference is in the bold part: the thread object is part of the shared state and will not be freed until the latter gets released (eg. by a future::get() call).
Why is this important ? because the standard does not specify to who the decayed copies are bound, we only know that they must outlive the callable invocation, but we don't know if they will be destroyed immediately after the call or at thread exit or when the thread object is destroyed (along with the shared state).
In fact, it turns out that gcc and clang implementations store the decayed copies in the shared state of the resulting future.
Consequently, in the const& version the vector copy is stored in the shared state and destroyed at future::get: this results in the "Start threads" loop allocating a new vector at each step, with a linear growth of memory.
Conversely, in the by-value version the vector copy is moved in the callable argument and destroyed as soon as the callable returns; at future::get, a moved, empty vector will be destroyed. So, if the callable is fast enough to destroy the vector before a new one is created, the same vector will be allocated over and over and memory will stay almost constant. This will result in less cache hits and faster allocations, explaining the improved timings.
As people said, without std::ref the object is being copied.
Now I believe that the reason that passing by value is actually faster might have something to do with the following question: Is it better in C++ to pass by value or pass by constant reference?
What might happen, that in the internal implementation of async, the vector being copied once to the new thread. And then internally being passed by reference, to a function which take ownership of the vector, which mean it will than be copied once again. On the other hand, if you pass it by value, it will copy it once to the new thread, but than will move it twice inside the new thread. Resulting in 2 copies if the object is being passed by reference, and 1 copy and 2 move in the second case if the object is passed by value.

Why a program starts slow but later gains the full speed?

I noticed that sometimes a program runs very slow but later the performance is good. For example, I have some code which I run in a loop and the first iteration takes ages but other iterations of the same code runs pretty fast. It's hard to name the circumstances because I can't figure it out and it seems that even single literal can affect this behavior. I prepared a small code snippet:
#include <chrono>
#include <vector>
#include <iostream>
using namespace std;
int main()
{
const int num{ 100000 };
vector<vector<int>> octs;
for (int i{ 0 }; i < num; ++i)
{
octs.emplace_back(vector<int>{ 42 });
}
vector<int> datas;
for (int i{ 0 }; i < num; ++i)
{
datas.push_back(42);
}
for (int n{ 0 }; n < 10; ++n)
{
cout << "start" << '\n';
//cout << 0 << "start" << '\n';
auto start = chrono::high_resolution_clock::now();
for (int i{ 0 }; i < num; ++i)
{
vector<int> points{ 42 };
}
auto end = chrono::high_resolution_clock::now();
auto time = chrono::duration_cast<chrono::milliseconds>(end - start);
cout << time.count() << '\n';
}
cin.get();
return 0;
}
The first two vectors are essential. At least with Visual Studio. Thought they're not in use they affect the performance a lot. Moreover, tweaking them also give performance effect (like change the order of initialization, remove push_back and allocate the necessary size in constructor). But this code as it is gives me the following results:
with gcc there're no problems at all
with clang the first iteration takes two times longer than the others
with vs2013 the first iteration is 100 (yes, one hundred) times slower.
Moreover, with vs2013 if I uncomment the line cout << 0 << "start" << '\n'; the performance problem goes away and all iterations are equal!
What's going on?
For your first two loops, probably the biggest performance consideration is going to be the allocation of memory, and the copying of the vector contents to the larger buffer. In this case, the fact that the loops appear to be 'gaining speed' is not surprising.
This is due to the implementation details of the vector class. Let's look at the documentation:
Internally, vectors use a dynamically allocated array to store their
elements. This array may need to be reallocated in order to grow in
size when new elements are inserted, which implies allocating a new
array and moving all elements to it. This is a relatively expensive
task in terms of processing time, and thus, vectors do not reallocate
each time an element is added to the container.
Instead, vector containers may allocate some extra storage to
accommodate for possible growth, and thus the container may have an
actual capacity greater than the storage strictly needed to contain
its elements (i.e., its size). Libraries can implement different
strategies for growth to balance between memory usage and
reallocations, but in any case, reallocations should only happen at
logarithmically growing intervals of size so that the insertion of
individual elements at the end of the vector can be provided with
amortized constant time complexity (see push_back).
So under the hood, the actual memory allocated for your vector might be much more than what you are actually using. So the vector only needs to do the costly re-allocation and copy when you add a new element to the vector which wouldn't fit into its current buffer. Moreover, since it says that re-allocations should only happen at logarithmically growing intervals, you can expect that the vector class is roughly doubling the buffer size every time it needs to re-allocate. But note that the vector implementations on various platforms are highly tuned to be optimal for the most common usage patterns for the class, which could be one factor in the different performance you are seeing across tool chains and platforms.
So you should see the loops be slow on the first several executions, and then gain more speed as push_back and emplace operations need to do fewer re-allocations and copies to accommodate the new elements.
So I think this is the main fact you can use to reason about how long your first two loops should take to execute. But for your specific examples, due to the simplicity of the program, the compiler may be taking some liberties with what code it generates. So we could imagine that a sufficiently clever optimizing compiler might be able to see that your vectors will only be growing to a size which it knows at compile time, num. And this is the biggest issue I suspect with your last loop, which seems like an arbitrary and useless test. For example, the nested loop in loop 3 can be optimized away entirely. I think this is the main reason why you are seeing such different run-time behavior across the different compilers.
If you want to get the real story, take a look at the assembly code that your compiler is generating.

clearing a vector or defining a new vector, which one is faster

Which method is faster and has less overhead?
Method 1:
void foo() {
std::vector< int > aVector;
for ( int i = 0; i < 1000000; ++i ) {
aVector.clear();
aVector.push_back( i );
}
}
Method 2:
void foo() {
for ( int i = 0; i < 1000000; ++i ) {
std::vector< int > aVector;
aVector.push_back( i );
}
}
You may say that the example is meaningless! But this is just a snippet from my big code. In short I want to know is it better to
"create a vector once and clear it for usage"
or
"create a new vector every time"
UPDATE
Thanks for the suggestions, I tested both and here are the results
Method 1:
$ time ./test1
real 0m0.044s
user 0m0.042s
sys 0m0.002s
Method 2:
$ time ./test2
real 0m0.601s
user 0m0.599s
sys 0m0.002s
Clearing the vector is better. Maybe this help someone else :)
The clear() is most likely to be faster, as you will retain the memory that has been allocated for previous push_back()s into the vector, thus decreasing the need for allocation.
Also you do away with 1 constructor call and 1 destructor call per loop.
This is all ignoring what you're compiler optimizer might do with this code.
To create an empty vector is very little overhead. To GROW a vector to a large size is potentially quite expensive, as it doubles in size each time - so a 1M entry vector would have 15-20 "copies" made of the current content.
For trivial basic types, such as int, the overhead of creating an object and destroying the object is "nothing", but for any more complex object, you will have to take into account the construction and destruction of the object, which is often substantially more than the "put the object in the vector" and "remove it from the vector". In other words, the constructor and destructor for each object is what matters.
For EVERY "which is faster of X or Y" you really need to benchmark for the circumstances that you want to understand, unless it's VERY obvious that one is clearly faster than the other (such as "linear search or binary search of X elements", where linear search is proportional to X, and binary search is log2(x)).
Further, I'm slightly confused about your example - storing ONE element in a vector is quite cumbersome, and a fair bit of overhead over int x = i; - I presume you don't really mean that as a benchmark. In other words, your particular comparison is not very fair, because clearly constructing 1M vectors is more work than constructing ONE vector and filling it and clearing it 1M times. However, if you made your test something like this:
void foo() {
for ( int i = 0; i < 1000; ++i ) {
std::vector< int > aVector;
for(j = 0; j < 1000; j++)
{
aVector.push_back( i );
}
}
}
[and the corresponding change to the other code], I expect the results would be fairly similar.

Auto in loop and optimizations

Can you explain me why there is such difference in computation time with the following codes (not optimized). I suspect RVO vs move-construction but I'm not really sure.
In general, what is the best practice when encountering such case ? Is auto declaration in a loop considered as a bad practice when initializing non-POD data ?
Using auto inside the loop :
std::vector<int> foo()
{
return {1,2,3,4,5};
}
int main()
{
for (size_t i = 0; i < 1000000; ++i)
auto f = foo();
return 0;
}
Output :
./a.out 0.17s user 0.00s system 97% cpu 0.177 total
Vector instance outside the loop :
std::vector<int> foo()
{
return {1,2,3,4,5};
}
int main()
{
std::vector<int> f;
for (size_t i = 0; i < 1000000; ++i)
f = foo();
return 0;
}
Output :
./a.out 0.32s user 0.00s system 99% cpu 0.325 total
I suspect RVO vs move-construction but I'm not really sure.
Yes, that is almost certainly what's happening. The first case move-initialises a variable from the function's return value: in this case, the move can be elided by making the function initialise it in place. The second case move-assigns from the return value; assignments can't be elided. I believe GCC performs elision even at optimisation level zero, unless you explicitly disable it.
In the final case (with -O3, which has now been removed from the question) the compiler probably notices that the loop has no side effects, and removes it entirely.
You might (or might not) get a more useful benchmark by declaring the vector volatile and compiling with optimisation. This will force the compiler to actually create/assign it on each iteration, even if it thinks it knows better.
Is auto declaration in a loop considered as a bad practice when initializing non-POD data ?
No; if anything, it's considered better practice to declare things in the narrowest scope that's needed. So if it's only needed in the loop, declare it in the loop. In some circumstances, you may get better performance by declaring a complicated object outside a loop to avoid recreating it on each iteration; but only do that when you're sure that the performance benefit (a) exists and (b) is worth the loss of locality.
I don't see your example having anything to do with auto. You wrote two different programs.
While
for (size_t i = 0; i < 1000000; ++i)
auto f = foo();
is equivalent to
for (size_t i = 0; i < 1000000; ++i)
std::vector<int> f = foo();
-- which means, you create a new vector (and destroying the old one). And, yes, in your foo-implementation using RVO, but that is not the point here: You still create a new vector in the place where the outer loop is making room for f.
The snippet
std::vector<int> f;
for (size_t i = 0; i < 1000000; ++i)
f = foo();
uses assign to an existing vector. And, yes, with RVO it may become a move-assign, depending on foo, and it is in your case, so you can expect it to be fast. But it still is a different thing -- it is always the one f that is in charge in managing the resources.
But what you do show very beautifully here is that it often makes sense to follow the general rule
Declare variables as close to their use as possible.
See this Discussion