I can't understand logic. I have two containers with same element count: vector and unordered_map.
I'm using function that checks time complexity of some function and returns value in milliseconds
auto funcComplexity = [](auto&& func, auto&&... params)
{
const auto& start = high_resolution_clock::now();
for (auto i = 0; i < 100; ++i)
{
// function invocation using perfect forwarding
std::forward<decltype(func)>(func)(std::forward<decltype(params)>(params)...);
}
const auto& stop = high_resolution_clock::now();
return duration_cast<duration<double, std::milli>>(stop - start).count();
};
When I erase an element from the middle of vector, it actually gives me less time than erasing element from unordered_map
void iterateUMap(std::unordered_map<int, Test> map)
{
map.erase(500);
}
void iterateVector(std::vector<Test> vector)
{
std::remove_if(vector.begin(), vector.end(), [](Test& val)
{
return val.mId == 500;
});
}
int main()
{
auto funcComplexity = [](auto&& func, auto&&... params)
{
const auto& start = high_resolution_clock::now();
for (auto i = 0; i < 10000; ++i)
{
// function invocation using perfect forwarding
std::forward<decltype(func)>(func)(std::forward<decltype(params)>(params)...);
}
const auto& stop = high_resolution_clock::now();
return duration_cast<duration<double, std::milli>>(stop - start).count();
};
std::unordered_map<int, Test> uMap;
for(int i = 0; i < 1000; i++)
{
uMap.try_emplace(i, Test(i));
}
std::vector<Test> vector;
vector.reserve(1000);
for (int i = 0; i < 1000; i++)
{
vector.emplace_back(i);
}
cout << funcComplexity(iterateUMap, uMap) << endl;
cout << endl;
cout << funcComplexity(iterateVector, vector) << endl;
return 0;
}
So the output of those two functions:
For vector is: 52.6565 milliseconds
For U_Map is : 6740.64 milliseconds
Why do erasing from u_map is actually slower than erasing from vector? The same thing happening when I just want to get element. For example:
void iterateUMap(std::unordered_map<int, Test> map)
{
map[500].DoSomething();
}
void iterateVector(std::vector<Test> vector)
{
for (auto& value : vector)
{
if(value.mId == 500)
{
value.DoSomething();
break;
}
}
}
You've just discovered one of the secrets about modern computer hardware.
Modern CPUs are very, very good at accessing memory linearly, you know, like how data in vectors happens get stored. Accessing X linear memory addresses is several orders of magnitude faster than access X random memory addresses; and X does not have to become very big before this is observed.
This is because, in general, memory gets retrieved from RAM in comparatively large blocks, and then stored in fast, on-CPU cache. Large blocks memory blocks get cached, accessing the next byte or word will then hit the fast on-CPU cache, and not result in a comparatively slow RAM access. And even before the next byte or word gets accessed, while the CPU is busy digesting the first couple of bytes a different part of CPU will proactively fetch the next cache line, anticipating future memory accesses.
Related
I am trying to measure the performance of concurrent insertion in folly hashmap. A simplified version of a program for such insertion is brought here:
#include <folly/concurrency/ConcurrentHashMap.h>
#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
const int kNumMutexLocks = 2003;
std::unique_ptr<std::mutex[]> mutices(new std::mutex[kNumMutexLocks]);
__inline__ void
concurrentInsertion(unsigned int threadId, unsigned int numInsertionsPerThread,
unsigned int numInsertions, unsigned int numUniqueKeys,
folly::ConcurrentHashMap<int, int> &follyMap) {
int base = threadId * numInsertionsPerThread;
for (int i = 0; i < numInsertionsPerThread; i++) {
int idx = base + i;
if (idx >= numInsertions)
break;
int val = idx;
int key = val % numUniqueKeys;
mutices[key % kNumMutexLocks].lock();
auto found = follyMap.find(key);
if (found != follyMap.end()) {
int oldVal = found->second;
if (oldVal < val) {
follyMap.assign(key, val);
}
} else {
follyMap.insert(key, val);
}
mutices[key % kNumMutexLocks].unlock();
}
}
void func(unsigned int numInsertions, float keyValRatio) {
const unsigned int numThreads = 12; // Simplified just for this post
unsigned int numUniqueKeys = numInsertions * keyValRatio;
unsigned int numInsertionsPerThread = ceil(numInsertions * 1.0 / numThreads);
std::vector<std::thread> insertionThreads;
insertionThreads.reserve(numThreads);
folly::ConcurrentHashMap<int, int> follyMap;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < numThreads; i++) {
insertionThreads.emplace_back(std::thread([&, i] {
concurrentInsertion(i, numInsertionsPerThread, numInsertions,
numUniqueKeys, follyMap);
}));
}
for (int i = 0; i < numThreads; i++) {
insertionThreads[i].join();
}
auto end = std::chrono::steady_clock::now();
auto diff = end - start;
float insertionTimeMs =
std::chrono::duration<double, std::milli>(diff).count();
std::cout << "i: " << numInsertions << "\tj: " << keyValRatio
<< "\ttime: " << insertionTimeMs << std::endl;
}
int main() {
std::vector<float> js = {0.5, 0.25};
for (auto j : js) {
std::cout << "-------------" << std::endl;
for (int i = 2048; i < 4194304 * 8; i *= 2) {
func(i, j);
}
}
}
The problem is that using this loop in main, suddenly increases the measured time in the func function. That is, if I call the function directly from main without any loop (as shown in what follows), the measure time for some cases is suddenly more than 100X smaller.
int main() {
func(2048, 0.25); // ~ 100X faster now that the loop is gone.
}
Possible Reasons
I allocate a huge amount of memory while building the hasmap. I believe when I run the code in a loop, while the second iteration of loop being executed the computer is busy freeing the memory for the first iteration. Hence, the program becomes much slower. If this is the case, I'd be grateful if someone can suggest a change that I can get the same results with loop.
More Details
Please note that if I unroll the loop in main, I have the same issue. That is, the following program has the same problem:
int main() {
performComputation(input A);
...
performComputation(input Z);
}
Sample Output
The output of the first program is shown here:
i: 2048 j: 0.5 time: 1.39932
...
i: 16777216 j: 0.5 time: 3704.33
-------------
i: 2048 j: 0.25 time: 277.427 <= sudden increase in execution time
i: 4096 j: 0.25 time: 157.236
i: 8192 j: 0.25 time: 50.7963
i: 16384 j: 0.25 time: 133.151
i: 32768 j: 0.25 time: 8.75953
...
i: 2048 j: 0.25 time: 162.663
Running the func alone in main with i=2048 and j=0.25 yields:
i: 2048 j: 0.25 time: 1.01
Any comment/insight is highly appreciated.
Iff it is the memory allocation that is slowing it down and the contents of the memory before performComputation(input) is irrelevant you could just re-use the allocated memory block.
int performComputation(input, std::vector<char>& memory) {
/* Note: memory will need to be passed by reference*/
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < numThreads; i++) {
t.emplace_back(std::thread([&, i] {
func(...); // Random access to memory
}));
}
for (int i = 0; i < numThreads; i++) {
t[i].join();
}
auto end = std::chrono::steady_clock::now();
float time = std::chrono::duration<double, std::milli>(end - start).count();
}
int main() {
// A. Allocate ~1GB memory here
std::vector<char> memory(1028 * 1028 * 1028) //is that 1 gig?
for (input: inputs)
performComputation(input, memory);
}
I can't be too confident on the exact details, but it seems to me to be a result of memory allocation in building the map. I replicated the behaviour you're seeing using a plain unordered_map and a single mutex, and making the map object in func static fixed it entirely. (Actually now it's slightly slower the first time around, since no memory has been allocated for the map yet, and then faster and a consistent time every subsequent run.)
I'm not sure why this makes a difference, since the map has been destructed and the memory should have been freed. For some reason it seems the map's freed memory isn't reused on subsequent calls to func. Perhaps someone else more knowledgeable than I can elaborate on this.
Edit: reduced minimal, reproducible example and output
void func(int num_insertions)
{
const auto start = std::chrono::steady_clock::now();
std::unordered_map<int, int> map;
for (int i = 0; i < num_insertions; ++i)
{
map.emplace(i, i);
}
const auto end = std::chrono::steady_clock::now();
const auto diff = end - start;
const auto time = std::chrono::duration<double, std::milli>(diff).count();
std::cout << "i: " << num_insertions << "\ttime: " << time << "\n";
}
int main()
{
func(2048);
func(16777216);
func(2048);
}
With non-static map:
i: 2048 time: 0.6035
i: 16777216 time: 4629.03
i: 2048 time: 124.44
With static map:
i: 2048 time: 0.6524
i: 16777216 time: 4828.6
i: 2048 time: 0.3802
Another edit: should also mention that the static version also requires a call to map.clear() at the end, though that's not really relevant to the question of the performance of the insertions.
When measuring wall clock time use averages !
You are measuring wall clock time. The actual time jumps seen is somewhat in the small range in this regard and could in theory be caused OS delays or other processing or perhaps it may be worse due to thread management(eg. cleanup) caused by your program (note this can vary a lot depending on platform/system and remember that a context switch can easily take ~10-15ms) There are just too many paramters in play to be sure.
When using wall clock to measure, it is a common practice to averaged over a loop of some hundreds or thousands of times to takes spikes/etc... into account
Use a profiler
Learn to use a profiler - a profiler can help you to quickly see what your program is actually spending time on and save preciouse time again and again.
While optimizing performance critical code, I noticed iterating over a std::set was a bit slow.
I then wrote a benchmarker and tested the speeds of iteration over a vector by iterator (auto it : vector), iterating over a set by iterator, and iterating over a vector by index (int i = 0; i < vector.size(); ++i).
The containers are constructed identically, with 1024 random ints. (Of course, each int is unique since we're working with sets). Then, for each run, we loop through the container and sum their ints into a long int. Each run has 1000 iterations doing the sum, and the test was averaged over 1000 runs.
Here are my results:
Testing vector by iterator
✓
Maximum duration: 0.012418
Minimum duration: 0.007971
Average duration: 0.008354
Testing vector by index
✓
Maximum duration: 0.002881
Minimum duration: 0.002094
Average duration: 0.002179
Testing set by iterator
✓
Maximum duration: 0.021862
Minimum duration: 0.014278
Average duration: 0.014971
As we can see, iterating over a set by iterator is 1.79x slower than by vector, and a whopping 6.87x slower than vector by index.
What is going on here? Isn't a set just a structured vector that checks whether each item is unique upon insertion? Why should it be so much slower?
Edit: Thank you for your replies! Good explanations. By request, here is the code of the benchmark.
#include <chrono>
#include <random>
#include <string>
#include <functional>
#include <set>
#include <vector>
void benchmark(const char* name, int runs, int iterations, std::function<void(int)> func) {
printf("Testing %s\n", name);
std::chrono::duration<double> min = std::chrono::duration<double>::max();
std::chrono::duration<double> max = std::chrono::duration<double>::min();
std::chrono::duration<double> run = std::chrono::duration<double>::zero();
std::chrono::duration<double> avg = std::chrono::duration<double>::zero();
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
// [removed] progress bar code
for (int i = 0; i < runs; ++i) {
t1 = std::chrono::high_resolution_clock::now();
func(iterations);
t2 = std::chrono::high_resolution_clock::now();
run = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
// [removed] progress bar code
if (run < min) min = run;
if (run > max) max = run;
avg += run / 1000.0;
}
// [removed] progress bar code
printf("Maximum duration: %f\n", max.count());
printf("Minimum duration: %f\n", min.count());
printf("Average duration: %f\n", avg.count());
printf("\n");
}
int main(int argc, char const *argv[]) {
const unsigned int arrSize = 1024;
std::vector<int> vector; vector.reserve(arrSize);
std::set<int> set;
for (int i = 0; i < arrSize; ++i) {
while (1) {
int entry = rand() - (RAND_MAX / 2);
auto ret = set.insert(entry);
if (ret.second) {
vector.push_back(entry);
break;
}
}
}
printf("Created vector of size %lu, set of size %lu\n", vector.size(), set.size());
benchmark("vector by iterator", 1000, 1000, [vector](int runs) -> void {
for (int i = 0; i < runs; ++i) {
long int sum = 0;
for (auto it : vector) {
sum += it;
}
}
});
benchmark("vector by index", 1000, 1000, [vector, arrSize](int runs) -> void {
for (int i = 0; i < runs; ++i) {
long int sum = 0;
for (int j = 0; j < arrSize; ++j) {
sum += vector[j];
}
}
});
benchmark("set by iterator", 1000, 1000, [set](int runs) -> void {
for (int i = 0; i < runs; ++i) {
long int sum = 0;
for (auto it : set) {
sum += it;
}
}
});
return 0;
}
I'm working on posting the results using O2, but I'm trying to get the compiler to avoid optimizing away the sum.
Isn't a set just a structured vector that checks whether each item is unique upon insertion?
No, by far not. These data structures are completely different, and the main distinction here is the memory layout: std::vector puts its element into a contiguous location in memory, while std::set is a node-based container, where every element is separately allocated and resides at distinct places in memory, possibly far away from each other and definitely in a way that pre-fetching data for fast traversal is impossible for the processor. This is quite the opposite for std::vector - as the next element is always just right "next to" the current one in memory, a CPU will load elements into its cache, and when actually processing the elements, it only has to go to the cache to retrieve the values - which is very fast compared to RAM access.
Note that it's a common need to have a sorted, unique collection of data that is laid out contiguously in memory, and C++2a or the version thereafter might actually ship with a flat_set, have a look at P1222.
Matt Austern's "Why you shouldn't use set (and what you should use instead)" is an interesting read, too.
The main reason is that when you iterate over a std::vector which stores its element in a contiguous memory chuck you basically do:
++p;
where p is a T* raw pointer. The stl code is:
__normal_iterator&
operator++() _GLIBCXX_NOEXCEPT
{
++_M_current; // <--- std::vector<>: ++iter
return *this;
}
For a std::set, the underlying object is more complex and in most implementations you iterate over a tree like structure. In its simplest form this is something like:
p=p->next_node;
where p is a pointer over a tree node structure:
struct tree_node {
...
tree_node *next_node;
};
but in practice the "real" stl code is much more complex:
_Self&
operator++() _GLIBCXX_NOEXCEPT
{
_M_node = _Rb_tree_increment(_M_node); // <--- std::set<> ++iter
return *this;
}
// ----- underlying code \/\/\/
static _Rb_tree_node_base*
local_Rb_tree_increment(_Rb_tree_node_base* __x) throw ()
{
if (__x->_M_right != 0)
{
__x = __x->_M_right;
while (__x->_M_left != 0)
__x = __x->_M_left;
}
else
{
_Rb_tree_node_base* __y = __x->_M_parent;
while (__x == __y->_M_right)
{
__x = __y;
__y = __y->_M_parent;
}
if (__x->_M_right != __y)
__x = __y;
}
return __x;
}
_Rb_tree_node_base*
_Rb_tree_increment(_Rb_tree_node_base* __x) throw ()
{
return local_Rb_tree_increment(__x);
}
const _Rb_tree_node_base*
_Rb_tree_increment(const _Rb_tree_node_base* __x) throw ()
{
return local_Rb_tree_increment(const_cast<_Rb_tree_node_base*>(__x));
}
(see: What is the definition of _Rb_tree_increment in bits/stl_tree.h?)
First of all you should note, that a std::set is sorted. This is typically achieved by storing the data in a tree-like structure.
A vector is typically stored in a contiguous memory area (like a simple array) which can therefore be cached. And this is why it is faster.
std::vector is a contiguous structure. The elements are all laid out in memory sequentially, so iterating it only requires an addition and a single pointer lookup per element. Additionally it's very cache-friendly since retrieving an element will generally cause a whole chunk of the vector to be loaded into cache.
std::set is a node-based structure; generally a red-black tree. Iterating over it is more involved and requires chasing several pointers per element. It's also not very cache-friendly since the elements aren't necessarily near each other in memory.
Simplified question with a working example: I want to reuse a std::unordered_map (let's call it umap) multiple times, similar to the following dummy code (which does not do anything meaningful). How can I make this code run faster?
#include <iostream>
#include <unordered_map>
#include <time.h>
unsigned size = 1000000;
void foo(){
std::unordered_map<int, double> umap;
umap.reserve(size);
for (int i = 0; i < size; i++) {
// in my real program: umap gets filled with meaningful data here
umap.emplace(i, i * 0.1);
}
// ... some code here which does something meaningful with umap
}
int main() {
clock_t t = clock();
for(int i = 0; i < 50; i++){
foo();
}
t = clock() - t;
printf ("%f s\n",((float)t)/CLOCKS_PER_SEC);
return 0;
}
In my original code, I want to store matrix entries in umap. In each call to foo, the key values start from 0 up to N, and N can be different in each call to foo, but there is an upper limit of 10M for indices. Also, values can be different (contrary to the dummy code here which is always i*0.1).
I tried to make umap a non-local variable, for avoiding the repeated memory allocation of umap.reserve() in each call. This requires to call umap.clear() at the end of foo, but that turned out to be actually slower than using a local variable (I measured it).
I don't think there is any good way to accomplish what you're looking for directly -- i.e. you can't clear the map without clearing the map. I suppose you could allocate a number of maps up-front, and just use each one of them a single time as a "disposable map", and then go on to use the next map during your next call, but I doubt this would give you any overall speedup, since at the end of it all you'd have to clear all of them at once, and in any case it would be very RAM-intensive and cache-unfriendly (in modern CPUs, RAM access is very often the performance bottleneck, and therefore minimizing the number cache misses is the way to maximize effiency).
My suggestion would be that if clear-speed is so critical, you may need to move away from using unordered_map entirely, and instead use something simpler like a std::vector -- in that case you can simply keep a number-of-valid-items-in-the-vector integer, and "clearing" the vector is a matter of just setting the count back to zero. (Of course, that means you sacrifice unordered_map's quick-lookup properties, but perhaps you don't need them at this stage of your computation?)
A simple and effective way is reusing same container and memory again and again with pass-by-reference as follows.
In this method, you can avoid their recursive memory allocation std::unordered_map::reserve and std::unordered_map::~unordered_map which both have the complexity O(num. of elemenrs):
void foo(std::unordered_map<int, double>& umap)
{
std::size_t N = ...// set N here
for (int i = 0; i < N; ++i)
{
// overwrite umap[0], ..., umap[N-1]
// If umap does not have key=i, then it is inserted.
umap[i] = i*0.1;
}
// do something and not access to umap[N], ..., umap[size-1] !
}
The caller side would be as follows:
std::unordered_map<int,double> umap;
umap.reserve(size);
for(int i=0; i<50; ++i){
foo(umap);
}
But since your key set is always continuous integers {1,2,...,N}, I think that std::vector which enables you to avoid hash calculations would be more preferable to save values umap[0], ..., umap[N]:
void foo(std::vector<double>& vec)
{
int N = ...// set N here
for(int i = 0; i<N; ++i)
{
// overwrite vec[0], ..., vec[N-1]
vec[i] = i*0.1;
}
// do something and not access to vec[N], ..., vec[size-1] !
}
Have you tried to avoid all memory allocation by using a simple array? You've said above that you know the maximum size of umap over all calls to foo():
#include <iostream>
#include <unordered_map>
#include <time.h>
constexpr int size = 1000000;
double af[size];
void foo(int N) {
// assert(N<=size);
for (int i = 0; i < N; i++) {
af[i] = i;
}
// ... af
}
int main() {
clock_t t = clock();
for(int i = 0; i < 50; i++){
foo(size /* or some other N<=size */);
}
t = clock() - t;
printf ("%f s\n",((float)t)/CLOCKS_PER_SEC);
return 0;
}
As I suggested in the comments, closed hashing would be better for your use case. Here's a quick&dirty closed hash map with a fixed hashtable size you could experiment with:
template<class Key, class T, size_t size = 1000003, class Hash = std::hash<Key>>
class closed_hash_map {
typedef std::pair<const Key, T> value_type;
typedef typename std::vector<value_type>::iterator iterator;
std::array<int, size> hashtable;
std::vector<value_type> data;
public:
iterator begin() { return data.begin(); }
iterator end() { return data.end(); }
iterator find(const Key &k) {
size_t h = Hash()(k) % size;
while (hashtable[h]) {
if (data[hashtable[h]-1].first == k)
return data.begin() + (hashtable[h] - 1);
if (++h == size) h = 0; }
return data.end(); }
std::pair<iterator, bool> insert(const value_type& obj) {
size_t h = Hash()(obj.first) % size;
while (hashtable[h]) {
if (data[hashtable[h]-1].first == obj.first)
return std::make_pair(data.begin() + (hashtable[h] - 1), false);
if (++h == size) h = 0; }
data.emplace_back(obj);
hashtable[h] = data.size();
return std::make_pair(data.end() - 1, true); }
void clear() {
data.clear();
hashtable.fill(0); }
};
It can be made more flexible by dynamically resizing the hashtable on demand when appropriate, and more efficient by using robin-hood replacment.
I am making a test program to measure time for storage of each container. The following is my code for the test.
#include <list>
#include <vector>
#include <iostream>
#include <iomanip>
#include <string>
#include <ctime>
#include <cstdlib>
using namespace std;
void insert(list<short>& l, const short& value);
void insert(vector<short>& v, const short& value);
void insert(short arr[], int& logicalSize, const int& physicalSize, const short& value);
int main() {
clock_t start, end;
srand(time(nullptr));
const int SIZE = 50000;
const short RANGE = 10000;
list<short> l;
vector<short> v;
short* arr = new short[SIZE];
int logicalSize = 0;
// array
start = clock();
cout << "Array storage time test...";
for (int i = 0; i < SIZE; i++) {
try {
insert(arr, logicalSize, SIZE, (short)(rand() % (2 * RANGE + 1) - RANGE));
} catch (string s) {
cout << s << endl;
system("pause");
exit(-1);
}
}
end = clock();
cout << "Time: " << difftime(end, start) << endl << endl;
// list
cout << "List storage time test...";
start = clock();
for (int i = 0; i < SIZE; i++) {
insert(l, (short)(rand() % (2 * RANGE + 1) - RANGE));
}
end = clock();
cout << "Time: " << difftime(end, start) << endl << endl;
// vector
cout << "Vector storage time test...";
start = clock();
for (int i = 0; i < SIZE; i++) {
insert(v, (short)(rand() % (2 * RANGE + 1) - RANGE));
}
end = clock();
cout << "Time: " << difftime(end, start) << endl << endl;
delete[] arr;
system("pause");
return 0;
}
void insert(list<short>& l, const short& value) {
for (auto it = l.begin(); it != l.end(); it++) {
if (value < *it) {
l.insert(it, value);
return;
}
}
l.push_back(value);
}
void insert(vector<short>& v, const short& value) {
for (auto it = v.begin(); it != v.end(); it++) {
if (value < *it) {
v.insert(it, value);
return;
}
}
v.push_back(value);
}
void insert(short arr[], int& logicalSize, const int& physicalSize, const short& value) {
if (logicalSize == physicalSize) throw string("No spaces in array.");
for (int i = 0; i < logicalSize; i++) {
if (value < arr[i]) {
for (int j = logicalSize - 1; j >= i; j--) {
arr[j + 1] = arr[j];
}
arr[i] = value;
logicalSize++;
return;
}
}
arr[logicalSize] = value;
logicalSize++;
}
However, when I execute the code, the result seems a little different from the theory. The list should be fastest, but the result said that insertion in the list is slowest. Can you tell me why?
Inserting into a vector or array requires moving everything after it; so if at a random spot, requires an average of 1.5 accesses to each element. 0.5 to find the spot, and 0.5*2 (read and write) to do the insert.
Inserting into a list requires 0.5 accesses per element (to find the spot).
This means the vector is only 3 times more element accesses.
Lists nodes are 5 to 9 times larger than vector "nodes" (which are just elements). Forward iteration requires reading 3 to 5 times as much memory (element 16 bits and pointer 32 to 64 bits).
So the list solution reads/writes more memory! Worse, it is sparser (with the back pointer), and it may not be arranged in a cache-friendly way in memory (vectors are contiguous; list nodes may be a mess in linear space) thus messing with cpu memory cache predictions and loads and etc.
List is very rarely faster than vector; you have to be inserting/deleting many times more often than you iterate over the list.
Finally vector uses exponential allocation with reserved unused space. List allocates each time. Calling new is slow, and often not much slower when you ask for bigger chunks than when you ask for smaller ones. Growing a vector by 1 at a time 1000 times results in about 15 allocations (give or take); for list, 1000 allocations.
Insertion in a list is blisteringly fast, but first you have to find there you want to insert. This is where list comes out a loser.
It might be helpful to stop and read Why is it faster to process a sorted array than an unsorted array? sometime around now because it covers similar material and covers it really well.
With a vector or array each element comes one after the next. Prediction is dead easy, so the CPU can be loading the cache with values you won't need for a while at the same time as it is processing the current value.
With a list predictability is shot, you have to get the next node before you can load the node after that, and that pretty much nullifies the cache. Without the cache you can see an order of magnitude degradation in performance as the CPU sits around waiting for data to be retrieved from RAM.
Bjarne Stroustrup has a number of longer pieces on this topic. The keynote video is definitely worth watching.
One important take-away is take Big-O notation with a grain of salt because it is measuring a the efficiency of the algorithm, not how well the algorithm takes advantage of the hardware.
I am trying to write some piece of code and convince myself that pass by value, pass by reference(rvalue and lvalue reference) should have significant impact on performance (related question). And later I came up with this code below and I thought the performance differences should be visible.
#include <iostream>
#include <vector>
#include <chrono>
#define DurationTy std::chrono::duration_cast<std::chrono::milliseconds>
typedef std::vector<int> VectTy;
size_t const MAX = 10000u;
size_t const NUM = MAX / 10;
int randomize(int mod) { return std::rand() % mod; }
VectTy factory(size_t size, bool pos) {
VectTy vect;
if (pos) {
for (size_t i = 0u; i < size; i++) {
// vect.push_back(randomize(size));
vect.push_back(i);
}
} else {
for (size_t i = 0u; i < size * 2; i++) {
vect.push_back(i);
// vect.push_back(randomize(size));
}
}
return vect;
}
long d1(VectTy vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
long d2(VectTy& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
long d3(VectTy&& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
int main(void) {
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
total += d1(factory(MAX, i % 2)); // T1
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
VectTy vect = factory(MAX, i % 2); // T2
total += d1(vect);
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
VectTy vect = factory(MAX, i % 2); // T3
total += d2(vect);
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
{
auto start = std::chrono::steady_clock::now();
long total = 0;
for (size_t i = 0; i < NUM; ++i) {
total += d3(factory(MAX, i % 2)); // T4
}
auto end = std::chrono::steady_clock::now();
std::cout << total << std::endl;
auto elapsed = DurationTy(end - start);
std::cerr << elapsed.count() << std::endl;
}
return 0;
}
I tested it on both gcc(4.9.2) and clang(trunk) with -std=c++11 option.
However I found that only when compiling with clang T2 takes more time (for one run, in milliseconds, 755,924,752,750). And I also compiled the -fno-elide-constructors version but with similar results.
(update: there are slight performance differences for T1, T3, T4 when compiled with Clang (trunk) on Mac OS X.)
My questions:
What are the optimizations applied that bridge the potential performance gaps between T1, T2, T3 in theory? (You can see that I also tried to avoid RVO in factory.)
What is the possible optimization applied for T2 by gcc in this case?
This is because of r-value references. you are passing in std::vector by value- which compiler figures out has move constructor and optimizes the copy to move.
See following link for details about rvalue refs: http://thbecker.net/articles/rvalue_references/section_01.html
update:
The following three methods turn out equivalent:
In here, you are passing in return of factory directly in function d1, compiler knows that the value returned is a temporary and std::vector (VectTy) has a move constructor defined- it just calls that move constructor (so this function is equivalent to d3
long d1(VectTy vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
Here you are passing by reference so no copy- OTOH and this shouldn't have compiled. unless you are using MSVC- in that case you should disable language extensions
long d2(VectTy& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
Of course there won't be any copy here, you are moving temporary vector (rvalue) from factory to d3
long d3(VectTy&& vect) {
long sum = 0;
for (auto& v : vect) sum += v;
return sum;
}
If you want to reproduce the copying performance issues, try rolling out your own vector class:
template<class T>
class MyVector
{
private:
std::vector<T> _vec;
public:
MyVector() : _vec()
{}
MyVector(const MyVector& other) : _vec(other._vec)
{}
MyVector& operator=(const MyVector& other)
{
if(this != &other)
this->_vec = other._vec;
return *this;
}
void push_back(T t)
{
this->_vec.push_back(t);
}
};
and use this instead of std::vector, you will for sure hit the performance problem you are looking for
An rvalue of type vector<T> will have its guts stolen by another vector<T> if you try to construct the second vector<T> from it. If you assign, it may have its guts stolen, or its contents may be moved, or something else (it is underspecified in the standard).
Constructing from an identical rvalue type is called move construction. For a vector, (in most implementations) it consists of reading 3 pointers, writing 3 pointers, and clearing 3 pointers. This is a cheap operation, regardless of how much data the vector holds.
Nothing in factory stops NRVO (a kind of elision). Regardless, when you return a local variable (in C++11 that exactly matches the return value types, or in C++14 that finds a compatible rvalue constructor) it is implicitly treated as an rvalue if elision does not occur. So the argument in factory will either be elided with the return value, or have its guts moved. The different in cost is trivial, and any difference could then be optimized away anyhow.
Your three functions d1 d2 and d3 should be better called "by-value", "by-lvalue" and "by-rvalue".
The call L1 has the return value elide into the argument to d1. If this elision fails (say you block it) it becomes a move construct, which is trivially more expensive.
The call L2 forces a copy.
The call L3 has no copy, and neither does L4.
Now, under the as-if rule, you can even skip copies if you can prove it can have no side effect (or, more accurately, if eliminating it is a valid variant under the standard for what could happen). gcc may be doing that, which would be a possible explanation for why L2 is no slower.
The problem with benchmarks of pointless tasks is that, under as-if, once the compiler can prove the task was pointless, it can eliminate it.
But I'm not surprised that L1 L3 and L4 are identical, as the standard mandates that they be basically identical in cost, up to a few pointer shuffles.