I'm trying to gain a better grip as to when should I implement move semantics on my code or just delegate that to the compiler, and while performing some basic profiling/benchmarking I got confused as to why moving a vector is actually taking longer than copying it:
#include <chrono>
#include <vector>
#include <utility>
#include <iostream>
#define REPS 10000000
int main(int argc, char const *argv[]) {
// ----------------------------------------------------
std::chrono::high_resolution_clock clock;
auto start = clock.now();
for (int i = 0; i < REPS; i++) {
std::vector< float >(700, 1.54343);
}
auto end = clock.now();
std::cout << "Instantiating vector [x" << std::to_string(REPS) << "]: " << std::chrono::duration_cast< std::chrono::milliseconds >(end - start).count() << " ms" << std::endl;
// ----------------------------------------------------
std::vector< float > src(700, 1.54343);
start = clock.now();
for (int i = 0; i < REPS; i++) {
std::vector< float > dst = src;
}
end = clock.now();
std::cout << "Copying from rvalue [x" << std::to_string(REPS) << "]: " << std::chrono::duration_cast< std::chrono::milliseconds >(end - start).count() << " ms" << std::endl;
// ----------------------------------------------------
start = clock.now();
for (int i = 0; i < REPS; i++) {
std::vector< float > dst = std::vector< float >(700, 1.54343); // vector< float >(700, 1.54343);
}
end = clock.now();
std::cout << "Moving from a temporary [x" << std::to_string(REPS) << "]: " << std::chrono::duration_cast< std::chrono::milliseconds >(end - start).count() << " ms" << std::endl;
}
I get the follwing output:
Instantiating vector [x10000000]: 1160 ms
Copying from rvalue [x10000000]: 1167 ms
Moving from a temporary [x10000000]: 1491 ms
But shouldn't moving it, at worst, be as efficient as copying it (or in this case, initializing a new one)? Is there anything wrong with how I'm profiling this code?
Weirdly enough, I also noticed that commenting the first loop had a significant impact on the time:
Instantiating vector [x10000000]: 0 ms
Copying from rvalue [x10000000]: 1176 ms
Moving from a temporary [x10000000]: 1049 ms
Which may have an obvious explanation, but seems quite strange to me.
You're not moving correctly! In your snippets the line when you want to move, consists of one creation operation and one assigning operation, that take much longer than a single assignment.
If you want to calculate move operation, you can write something like this:
std::vector< float > dst = std::move(src);
My test result without this modification:
Instantiating vector [x10000000]: 22000 ms
Copying from rvalue [x10000000]: 19912 ms
Moving from a temporary [x10000000]: 22694 ms
My test result with this modification:
Instantiating vector [x10000000]: 23187 ms
Copying from rvalue [x10000000]: 20267 ms
Moving from a temporary [x10000000]: 593 ms
Related
I'm trying to parallelize some old code using the Execution Policy from the C++ 17. My sample code is below:
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <execution>
#include <vector>
using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::duration<double>;
constexpr auto NUM = 100'000'000U;
double func()
{
return rand();
}
int main()
{
std::vector<double> v(NUM);
// ------ feature testing
std::cout << "__cpp_lib_execution : " << __cpp_lib_execution << std::endl;
std::cout << "__cpp_lib_parallel_algorithm: " << __cpp_lib_parallel_algorithm << std::endl;
// ------ fill the vector with random numbers sequentially
auto const startTime1 = Clock::now();
std::generate(std::execution::seq, v.begin(), v.end(), func);
Duration const elapsed1 = Clock::now() - startTime1;
std::cout << "std::execution::seq: " << elapsed1.count() << " sec." << std::endl;
// ------ fill the vector with random numbers in parallel
auto const startTime2 = Clock::now();
std::generate(std::execution::par, v.begin(), v.end(), func);
Duration const elapsed2 = Clock::now() - startTime2;
std::cout << "std::execution::par: " << elapsed2.count() << " sec." << std::endl;
}
The program output on my Linux desktop:
__cpp_lib_execution : 201902
__cpp_lib_parallel_algorithm: 201603
std::execution::seq: 0.971162 sec.
std::execution::par: 25.0349 sec.
Why does the parallel version performs 25 times worse than the sequential one?
Compiler: g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0
The thread-safety of rand is implementation-defined. Which means either:
Your code is wrong in the parallel case, or
It's effectively serial, with a highly contended lock, which would dramatically increase the overhead in the parallel case and get incredibly poor performance.
Based on your results, I'm guessing #2 applies, but it could be both.
Either way, the answer is: rand is a terrible test case for parallelism.
I want to measure the time it takes to call a function.
Here is the code:
for (int i = 0; i < 5; ++i) {
std::cout << "Pass : " << i << "\n";
const auto t0 = std::chrono::high_resolution_clock::now();
system1.euler_intregration(0.0166667);
const auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count() << "\n";
but the compiler keeps optimising the loop so time is not measured and returns zero.
I have tried using asm("") and __asm__("") as advised here but nothing works for me.
I must admit that I don't really know how these asm() functions works so I might be using them in the wrong way.
In the following code, I have 2 nested for loops. The second one swaps the order of the for loops, and runs significantly faster.
Is this purely a cache locality issue (the first code loops over a vector many times, whereas the second code loops over the vector once), or is there something else that I'm not understanding?
int main()
{
using namespace std::chrono;
auto n = 1 << 12;
vector<int> v(n);
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for(int i = 0; i < (1 << 16); ++i)
{
for(const auto val : v) i & val;
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
t1 = high_resolution_clock::now();
for(const auto val : v)
{
for(int i = 0; i < (1 << 16); ++i) i & val;
}
t2 = high_resolution_clock::now();
time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
}
As written, the second loop needs to read each val from vector v only once. The first version needs to read each val from vector v once in the inner loop for every i, so in total 65536 times.
So without any optimisation, this will make the second loop several times faster. With optimisation turned on high enough, the compiler will figure out that all these calculations achieve nothing, and are unnecessary, and throw them all away. Your execution times will then go down to zero.
If you change the code to do something with the results (like adding up all values i & val, then printing the total), a really good compiler may figure out that both pieces of code produce the same result and use the faster method for both cases.
Given an Eigen fixed size type, say an Eigen::Vector3d, why is the type not PoD? The underlying data is an array of 3 doubles, and there should not be a need for a nontrivial constructor or destructor.
Template wise, a lot bit (depends on the version) goes on in the constructor, albeit at compile time. While all of this is evaluated at compile time and therefore optimized out, there is still a remaining empty constructor. If you add an empty constructor to a POD type, it too will not be memcpy'ed when using std::copy. Try this:
#include <chrono>
#include <Eigen/Core>
#include <vector>
#include <iostream>
struct notpod
{
notpod() {}
double d[3];
};
struct pod
{
double d[3];
};
using Eigen::Vector3d;
int main(int argc, char** argv)
{
std::chrono::time_point<std::chrono::high_resolution_clock > start, end;
int sz = 20000000;
{
std::vector<pod> a(sz), b(sz);
start = std::chrono::high_resolution_clock ::now();
std::copy(a.begin(), a.end(), b.begin());
end = std::chrono::high_resolution_clock ::now();
std::cout << " POD vector copy took " << (std::chrono::duration<double>(end - start)).count() << " seconds.\n";
}
{
std::vector<notpod> na(sz), nb(sz);
start = std::chrono::high_resolution_clock ::now();
std::copy(na.begin(), na.end(), nb.begin());
end = std::chrono::high_resolution_clock ::now();
std::cout << " NotPOD vector copy took " << (std::chrono::duration<double>(end - start)).count() << " seconds.\n";
}
{
std::vector<Vector3d> a3(sz), b3(sz);
start = std::chrono::high_resolution_clock ::now();
std::copy(a3.begin(), a3.end(), b3.begin());
end = std::chrono::high_resolution_clock ::now();
std::cout << "Vector3d vector copy took " << (std::chrono::duration<double>(end - start)).count() << " seconds.\n";
}
return 0;
}
which outputs on my machine:
POD vector copy took 0.135008 seconds.
NotPOD vector copy took 0.35202 seconds.
Vector3d vector copy took 0.35302 seconds.
I created a little performance test comparing the setup and access times of three popular techniques for dynamic allocation: raw pointer, std::unique_ptr, and a std::deque.
EDIT: per #NathanOliver's, added std::vector:
EDIT 2: per latedeveloper's, allocated with std::vector(n) and std::deque(n) constructors
EDIT 3: per #BaummitAugen, moved allocation inside timing loop, and compiled an optimized version.
EDIT 4: per #PaulMcKenzie's comments, set runs to 2000.
Results: These changes have tightened things up a lot. Deque and Vector are still slower on allocation and assignment, while deque is much slower on access:
pickledEgg$ g++ -std=c++11 -o sp2 -O2 sp2.cpp
Average of 2000 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000085643 0.0000000724
Smart: 0.0000085281 0.0000000732
Deque: 0.0000205775 0.0000076908
Vector: 0.0000163492 0.0000000760
Just for fun, here are -Ofast results:
pickledEgg$ g++ -std=c++11 -o sp2 -Ofast sp2.cpp
Average of 2000 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000045316 0.0000000893
Smart: 0.0000038308 0.0000000730
Deque: 0.0000165620 0.0000076475
Vector: 0.0000063442 0.0000000699
ORIGINAL: For posterity; note lack of optimizer -O2 flag:
pickledEgg$ g++ -std=c++11 -o sp2 sp2.cpp
Average of 100 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000466522 0.0000468586
Smart: 0.0004391623 0.0004406758
Deque: 0.0003144142 0.0021758729
Vector: 0.0004715145 0.0003829193
Updated Code:
#include <iostream>
#include <iomanip>
#include <vector>
#include <deque>
#include <chrono>
#include <memory>
const int NUM_RUNS(2000);
int main() {
std::chrono::high_resolution_clock::time_point b, e;
std::chrono::duration<double> t, raw_assign(0), raw_access(0), smart_assign(0), smart_access(0), deque_assign(0), deque_access(0), vector_assign(0), vector_access(0);
int k, tmp, n(32768);
std::cout << "Average of " << NUM_RUNS << " runs:" << std::endl;
std::cout << "Method " << '\t' << "Assign" << "\t\t" << "Access" << std::endl;
std::cout << "====== " << '\t' << "======" << "\t\t" << "======" << std::endl;
// Raw
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
int* raw_p = new int[n]; // run-time allocation
for (int i=0; i<n; ++i) { //assign
raw_p[i] = i;
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
raw_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = raw_p[i];
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
raw_access+=t;
delete [] raw_p; // :^)
}
raw_assign /= NUM_RUNS;
raw_access /= NUM_RUNS;
std::cout << "Raw: " << '\t' << std::setprecision(10) << std::fixed << raw_assign.count() << '\t' << raw_access.count() << std::endl;
// Smart
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
std::unique_ptr<int []> smart_p(new int[n]); // run-time allocation
for (int i=0; i<n; ++i) { //assign
smart_p[i] = i;
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
smart_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = smart_p[i];
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
smart_access+=t;
}
smart_assign /= NUM_RUNS;
smart_access /= NUM_RUNS;
std::cout << "Smart: " << '\t' << std::setprecision(10) << std::fixed << smart_assign.count() << '\t' << smart_access.count() << std::endl;
// Deque
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
std::deque<int> myDeque(n);
for (int i=0; i<n; ++i) { //assign
myDeque[n] = i;
// myDeque.push_back(i);
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
deque_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = myDeque[n];
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
deque_access+=t;
}
deque_assign /= NUM_RUNS;
deque_access /= NUM_RUNS;
std::cout << "Deque: " << '\t' << std::setprecision(10) << std::fixed << deque_assign.count() << '\t' << deque_access.count() << std::endl;
// vector
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
std::vector<int> myVector(n);
for (int i=0; i<n; ++i) { //assign
myVector[i] = i;
// .push_back(i);
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
vector_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = myVector[i];
// tmp = *(myVector.begin() + i);
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
vector_access+=t;
}
vector_assign /= NUM_RUNS;
vector_access /= NUM_RUNS;
std::cout << "Vector:" << '\t' << std::setprecision(10) << std::fixed << vector_assign.count() << '\t' << vector_access.count() << std::endl;
std::cout << std::endl;
return 0;
}
As you can see from the results, raw pointers are the clear winner in both categories. Why is this?
Because ...
g++ -std=c++11 -o sp2 sp2.cpp
... you didn't enable optimization. Calling an operator overloaded for a non-fundamental type such as std::vector or std::unique_ptr involves a function call. Using operators of fundamental types like a raw pointer do not involve function calls.
A function call is typically slower than no function call. Over several iterations, the small overhead of the function call multiplies. However, an optimizer can expand function calls inline thereby making the disadvantage of non-fundamental types void. But only if the optimization is performed.
std::deque has an additional reason for being slower: The algorithm to access an arbitrary element of a double ended queue is more complicated than accessing an array. While std::deque has decent random access performance, it is not as good array has. A more appropriate use case for std::deque is linear iteration (using an iterator).
Furthermore, you used std::deque::at, which does bounds checking. The subscript operator does not do bounds checking. Bounds checking adds runtime overhead.
The slight edge that the raw array appears to have with the allocation speed over the std::vector, may be because std::vector zero-initializes the data.
A std::deque is a doubly linked list. myDeque.at(i) has to walk through the first i elements on every call. That is why the access to the deque is so slow.
The initialiation of std::vector is slow, because you don't preallocate enough memory. std::vector then starts with a small number of elements and usually doubles that as soon as you try to insert more. This reallocation involves calling the move constructor for all elements. Try to construct the vector like this:
std::vector<int> myVector{n};
in the vector access I wonder why you didn't use tmp = myVector[i]. Instead of calling the index operator, you instantiate an iterator, call its + operator and on the result you call the dereference operator. Since you are not optimizing, function calls will probably not be inlined, so that is, why std::vector access is slower than the raw pointer.
For the std::uniqe_ptr I suppose, that it has similar reasons as with std::vector. You always call the index operator on the unique pointer, which is a function call as well. Just as an experiment, can you please try and immediately after allocating the memory for smart_p, call smart_p.get() and use the raw pointer for the rest of the operations. I assume, that it will be just as fast as the raw pointer. That could prove my assumption, that it is the function calls. Then the simple advice is, enable optimizations and try again.
kmiklas edit:
Average of 2000 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000086415 0.0000000681
Smart: 0.0000081824 0.0000000670
Deque: 0.0000204542 0.0000076554
Vector: 0.0000164252 0.0000000678