Why is unordered_map and map giving the same performance?

Why is unordered_map and map giving the same performance? - c++

Here is my code, my unordered_map and map are behaving the same and taking the same time to execute. Am I missing something about these data structures?
Update: I've changed my code based on the below answers and comments. I've removed string operation to reduce the impact in profiling. Also now am only measuring the find() which takes almost 40% of CPU in my code. The profile shows that unordered_map is 3 times faster, however, is there any other way to make this code faster?
#include <map>
#include <unordered_map>
#include <stdio.h>
struct Property {
int a;
};
int main() {
printf("Performance Summery:\n");
static const unsigned long num_iter = 999999;
std::unordered_map<int, Property > myumap;
for (int i = 0; i < 10000; i++) {
int ind = rand() % 1000;
Property p;
p.a = i;
myumap.insert(std::pair<int, Property> (ind, p));
}
clock_t tStart = clock();
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
std::unordered_map<int, Property >::iterator itr = myumap.find(ind);
}
printf("Time taken unordered_map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
std::map<int, Property > mymap;
for (int i = 0; i < 10000; i++) {
int ind = rand() % 1000;
Property p;
p.a = i;
mymap.insert(std::pair<int, Property> (ind, p));
}
tStart = clock();
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
std::map<int, Property >::iterator itr = mymap.find(ind);
}
printf("Time taken map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
}
The output is here
Performance Summery:
Time taken unordered_map: 0.12s
Time taken map: 0.36s

Without going into your code, I would make a few general comments.
What exactly are you measuring? Your profiling includes both populating and scanning the data structures. Given that (presumably) populating an ordered map would take longer, measuring both works against the idea of the gains (or otherwise) of an ordered map. Figure out what you are measuring and just measure that.
You also have a lot going on in the code that is probably incidental to what you are profiling: there is a lot of object creation, string concatenation, etc etc. This is probably what you are actually measuring. Focus on profiling only what your want to measure (see point 1).
10,000 cases is way too small. At this scale other considerations can overwhelm what you are measuring, particularly when you are measuring everything.

There is a reason we like getting minimal, complete and verifiable examples. Here's my code:
#include <map>
#include <unordered_map>
#include <stdio.h>
struct Property {
int a;
};
static const unsigned long num_iter = 100000;
int main() {
printf("Performance Summery:\n");
clock_t tStart = clock();
std::unordered_map<int, Property> myumap;
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
Property p;
//p.fileName = "hello" + to_string(i) + "world!";
p.a = i;
myumap.insert(std::pair<int, Property> (ind, p));
}
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
myumap.find(ind);
}
printf("Time taken unordered_map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
tStart = clock();
std::map<int, Property> mymap;
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
Property p;
//p.fileName = "hello" + to_string(i) + "world!";
p.a = i;
mymap.insert(std::pair<int, Property> (ind, p));
}
for (int i = 0; i < num_iter; i++) {
int ind = rand() % 1000;
mymap.find(ind);
}
printf("Time taken map: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
}
Run time is:
Performance Summery:
Time taken unordered_map: 0.04s
Time taken map: 0.07s
Please note that I am running 10 times the number of iterations you were running.
I suspect there are two problems with your version. The first is that you are running too little iterations for it to make a difference. The second is that you are doing expensive string operations inside the counted loop. The time it takes to run the string operations is greater than the time saved by using unordered map, hence you not seeing the difference in performance.

Whether a tree (std::map) or a hash map (std::unordered_map) is faster really depends on the number of entries and the characteristics of the key (the variability of the values, the compare and hashing functions, etc.)
But in theory, a tree is slower than a hash map because insertion and searching inside a binary tree is O(log2(N)) complexity while insertion and searching inside a hash map is roughly O(1) complexity.
Your test didn't show it because:
You call rand() in a loop. That takes ages in comparison with the map insertion. And it generates different values for the two maps you're testing, skewing results even further. Use a lighter-weight generator e.g. a minstd LCG.
You need a higher resolution clock and more iterations so that each test run takes at least a few hundred milliseconds.
You need to make sure the compiler does not reorder your code so the timing calls happen where they should. This is not always easy. A memory fence around the timed test usually helps to solve this.
Your find() calls have a high probability of being optimized away since you're not using their value (I just happen to know that at least GCC in -O2 mode doesn't do that, so I leave it as is).
String concatenation is also very slow in comparison.
Here's my updated version:
#include <atomic>
#include <chrono>
#include <iostream>
#include <map>
#include <random>
#include <string>
#include <unordered_map>
using namespace std;
using namespace std::chrono;
struct Property {
string fileName;
};
const int nIter = 1000000;
template<typename MAP_TYPE>
long testMap() {
std::minstd_rand rnd(12345);
std::uniform_int_distribution<int> testDist(0, 1000);
auto tm1 = high_resolution_clock::now();
atomic_thread_fence(memory_order_seq_cst);
MAP_TYPE mymap;
for (int i = 0; i < nIter; i++) {
int ind = testDist(rnd);
Property p;
p.fileName = "hello" + to_string(i) + "world!";
mymap.insert(pair<int, Property>(ind, p));
}
atomic_thread_fence(memory_order_seq_cst);
for (int i = 0; i < nIter; i++) {
int ind = testDist(rnd);
mymap.find(ind);
}
atomic_thread_fence(memory_order_seq_cst);
auto tm2 = high_resolution_clock::now();
return (long)duration_cast<milliseconds>(tm2 - tm1).count();
}
int main()
{
printf("Performance Summary:\n");
printf("Time taken unordered_map: %ldms\n", testMap<unordered_map<int, Property>>());
printf("Time taken map: %ldms\n", testMap<map<int, Property>>());
}
Compiled with -O2, it gives the following results:
Performance Summary:
Time taken unordered_map: 348ms
Time taken map: 450ms
So using unordered_map in this particular case is faster by ~20-25%.

It's not just the lookup that's faster with an unordered_map. This slightly modified test also compares the fill times.
I have made a couple of modifications:
increased sample size
both maps now use the same sequence of random numbers.
-
#include <map>
#include <unordered_map>
#include <vector>
#include <stdio.h>
struct Property {
int a;
};
struct make_property : std::vector<int>::const_iterator
{
using base_class = std::vector<int>::const_iterator;
using value_type = std::pair<const base_class::value_type, Property>;
using base_class::base_class;
decltype(auto) get() const {
return base_class::operator*();
}
value_type operator*() const
{
return std::pair<const int, Property>(get(), Property());
}
};
int main() {
printf("Performance Summary:\n");
static const unsigned long num_iter = 9999999;
std::vector<int> keys;
keys.reserve(num_iter);
std::generate_n(std::back_inserter(keys), num_iter, [](){ return rand() / 10000; });
auto time = [](const char* message, auto&& func)
{
clock_t tStart = clock();
func();
clock_t tEnd = clock();
printf("%s: %.2gs\n", message, double(tEnd - tStart) / CLOCKS_PER_SEC);
};
std::unordered_map<int, Property > myumap;
time("fill unordered map", [&]
{
myumap.insert (make_property(keys.cbegin()),
make_property(keys.cend()));
});
std::map<int, Property > mymap;
time("fill ordered map",[&]
{
mymap.insert(make_property(keys.cbegin()),
make_property(keys.cend()));
});
time("find in unordered map",[&]
{
for (auto k : keys) { myumap.find(k); }
});
time("find in ordered map", [&]
{
for (auto k : keys) { mymap.find(k); }
});
}
example output:
Performance Summary:
fill unordered map: 3.5s
fill ordered map: 7.1s
find in unordered map: 1.7s
find in ordered map: 5s

Related

Is accessing container element time-consuming?

I want to count GCD of integers and save them. I find that the time consuming part is not to calculate GCD but to save result to the map. Do I use std::map in a bad way?
#include <set>
#include <iostream>
#include <chrono>
#include "timer.h"
using namespace std;
int gcd (int a, int b)
{
int temp;
while (b != 0)
{
temp = a % b;
a = b;
b = temp;
}
return(a);
}
int main() {
map<int,int> res;
{
Timer timer;
for(int i = 1; i < 10000; i++)
{
for(int j = 2; j < 10000; j++)
res[gcd(i,j)]++;
}
}
{
Timer timer;
for(int i = 1; i < 10000; i++)
{
for(int j = 2; j < 10000; j++)
gcd(i, j);
}
}
}
6627099us(6627.1ms)
0us(0ms)

You should use some real benchmarking library to test this kind of code. In your particular case, the second loop where you discard the results of gcd was probably optimized away. With quickbench I see not that much difference between running just the algorithm and storing the results in std::map or std::unordered_map. I used randomized integers for testing, which is maybe not the best for GCD algorithm, but you can try other approaches.
Code under benchmark without storage:
constexpr int N = 10000;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> distrib(1, N);
benchmark::DoNotOptimize(gcd(distrib(gen), distrib(gen)));
and with storage:
benchmark::DoNotOptimize(res[gcd(distrib(gen), distrib(gen))]++);
Results:

You are using std::map correctly. However, you are using an inefficient container for your problem. Given that the possible values of gcd(x,y) are bounded by N, a std::vector would be the most efficient container to store the results.
Specifically,
int main() {
const int N = 10'000;
std::vector<int> res(N, 0); // initialize to N elements with value 0.
...
}
Using parallelism will speed up the program even further. Each thread would have it's own std::vector to compute local results. Once a thread is finished, the results would be added to the result vector in a thread-safe manner (e.g. using std::mutex).

Getting speed improvement with OpenMP in nested for loops with dependencies

I am trying to implement a procedure in parallel processing form with OpenMP. It contains four level nested for loops (dependent) and has a variable sum_p to be updated in the innermost loop. In short, the my question is regarding the parallel implementation of the following code snippet:
for (int i = (test_map.size() - 1); i >= 1; --i) {
bin_i = test_map.at(i); //test_map is a "STL map of vectors"
len_rank_bin_i = bin_i.size(); // bin_i is a vector
for (int j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i]; //node_u is a scalar
for (int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
The full program is given below:
#include <iostream>
#include <vector>
#include <omp.h>
#include <random>
#include <unordered_map>
#include <algorithm>
#include <functional>
#include <time.h>
int main(int argc, char* argv[]){
double time_temp;
int test_map_size = 5000;
std::unordered_map<unsigned int, std::vector<unsigned int> > test_map(test_map_size);
// Fill the test map with random intergers ---------------------------------
std::random_device rd;
std::mt19937 gen1(rd());
std::uniform_int_distribution<int> dist(1, 5);
auto gen = std::bind(dist, gen1);
for(int i = 0; i < test_map_size; i++)
{
int vector_len = dist(gen1);
std::vector<unsigned int> tt(vector_len);
std::generate(begin(tt), end(tt), gen);
test_map.insert({i,tt});
}
// Sequential implementation -----------------------------------------------
time_temp = omp_get_wtime();
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
int sum_s = 0;
for (unsigned int i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
for (unsigned int j = i; j-- > 0; ) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (unsigned int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
for (unsigned int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_s += 1;
}
}
}
}
std::cout<<"Estimated sum (seq): "<<sum_s<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for sequential implementation: %.2fs\n", time_temp);
// Parallel implementation -----------------------------------------------
time_temp = omp_get_wtime();
int sum_p = 0;
omp_set_num_threads(4);
#pragma omp parallel
{
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
unsigned int i, u_i, v_i;
int j;
#pragma omp parallel for private(j,u_i,v_i) reduction(+:sum_p)
for (i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
#pragma omp parallel for private(u_i,v_i)
for (j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
#pragma omp parallel for private(v_i)
for (u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
#pragma omp parallel for
for (v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
}
std::cout<<"Estimated sum (parallel): "<<sum_p<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for parallel implementation: %.2fs\n", time_temp);
return 0;
}
Running the code with command g++-7 -fopenmp -std=c++11 -O3 -Wall -o so_qn so_qn.cpp in macOS 10.13.3 (i5 processor with four logical cores) gives the following output:
Estimated sum (seq): 38445750
Time taken for sequential implementation: 0.49s
Estimated sum (parallel): 38445750
Time taken for parallel implementation: 50.54s
The time taken for parallel implementation is multiple times higher than sequential implementation. Do you think the code or logic can deduced to parallel implementation? I have spent a few days to improve the terrible performance of my code but to no avail. Any help is greatly appreciated.
Update
With the changes suggested by JimCownie, i.e., "using omp for, not omp parallel for" and removing the parellelism of inner loops, the performance is greatly improved.
Estimated sum (seq): 42392944
Time taken for sequential implementation: 0.48s
Estimated sum (parallel): 42392944
Time taken for parallel implementation: 0.27s
My CPU has four logical cores (and I am using four threads), now I am wondering, would there be anyway to get four times better performance than the sequential implementation.
I see a different problem here when my map of vectors test_map is short, but fat at each level, i.e., the map size is small and but the vector size at each of the keys is very large. In such a case the performance of sequential and parallel implementations are comparable, without much difference. It seems like we need to parallelize inner loops too. Do you know how to achieve it in this context?

Use std::chrono::high_resolution_clock to measure std::lower_bound execution time?

I used std::chrono::high_resolution_clock to measure std::lower_bound execution time. Here is my test code:
#include <iostream>
#include <algorithm>
#include <chrono>
#include <random>
const long SIZE = 1000000;
using namespace std::chrono;
using namespace std;
int sum_stl(const std::vector<double>& array, const std::vector<double>& vals)
{
long temp;
auto t0 = high_resolution_clock::now();
for(const auto& val : vals) {
temp += lower_bound(array.begin(),array.end(),val) - array.begin();
}
auto t1 = high_resolution_clock::now();
cout << duration_cast<duration<double>>(t1-t0).count()/vals.size()
<< endl;
return temp;
}
int main() {
const int N = 1000;
vector<double> array(N);
auto&& seed = high_resolution_clock::now().time_since_epoch().count();
mt19937 rng(move(seed));
uniform_real_distribution<float> r_dist(0.f,1.f);
generate(array.begin(),array.end(),[&](){return r_dist(rng);});
sort(array.begin(), array.end());
vector<double> vals;
for(int i = 0; i < SIZE; ++i) {
vals.push_back(r_dist(rng));
}
int index = sum_stl(array, vals);
return 0;
}
array is a sorted vector with 1000 uniformed random numbers. vals has a size of 1 million. At first I set the timer inside the loop to measure every single std::lower_bound execution, timing result was around 1.4e-7 seconds. Then I tested other operations like +, -, sqrt, exp, but they all gave the same result as std::lower_bound.
In a former topic resolution of std::chrono::high_resolution_clock doesn't correspond to measurements, it's said that the 'chrono' resolution might be not enough to represent a duration less than 100 nanoseconds. So I set a timer for the whole loop and get an average by dividing the iteration number. Here is the output:
1.343e-14
There must be something wrong since it gave a duration even less than the CPU cycle time, but I just can't figure it out.
To make the question more general, how can I measure accurate execution time for a short function?

C++ multithreading with openMP: poor performance despite localized variables (false sharing?)

I have run into a fairly weird openMP problem.
The task is to take a vector of strings and split each element into its contained k-Mers (all contained substrings of length k). This should parallelize trivially along the elements of the vector, as the k-Merification procedure happens independently for each element. I want to store the results in a map/set STL data structure (std::map<long long, std::map<std::string, std::set<unsigned int> > > local_forReturn) , and I allocate a thread-local variable for that.
The achieved parallelization behaviour, however, is surprisingly bad - top on linux shows CPU usage of ~ 200%, despite running with 40 threads on a 40 core machine. (And I have tested that the #omp critical section is not the bottleneck).
My hunch is that this might be related to false sharing, as the actual data contained in my localized map/set STL classes will end up on the heap. However, I have neither an idea of how to test my intuition, or how to reduce false sharing for STL constructs (if this is the problem). I would greatly appreciate any ideas!
Complete code:
#include <string>
#include <assert.h>
#include <set>
#include <map>
#include <vector>
#include <omp.h>
#include <iostream>
int threads = 40;
int k = 31;
std::string generateRandomSequence(int length);
char randomNucleotide();
std::vector<std::string> partitionStringIntokMers(std::string str, int k);
int main(int argc, char *argv[])
{
// generate test data
std::vector<std::string> requiredSEQ;
for(unsigned int i = 0; i < 10000; i++)
{
std::string seq = generateRandomSequence(20000);
requiredSEQ.push_back(seq);
}
// this variable will contain the final result
std::map<long long, std::map<std::string, std::map<unsigned int, int> > > forReturn;
omp_set_num_threads(threads);
std::cerr << "Data generated, now start parallel processing\n" << std::flush;
// split workload (ie requiredSEQ) according to number of threads
long long max_i = requiredSEQ.size() - 1;
long long chunk_size = max_i / threads;
#pragma omp parallel
{
assert(omp_get_num_threads() == threads);
long long thisThread = omp_get_thread_num();
long long firstPair = thisThread * chunk_size;
long long lastPair = (thisThread+1) * chunk_size - 1;
if((thisThread == (threads-1)) && (lastPair < max_i))
{
lastPair = max_i;
}
std::map<long long, std::map<std::string, std::map<unsigned int, int> > > local_forReturn;
for(long long seqI = firstPair; seqI <= lastPair; seqI++)
{
const std::string& SEQ_sequence = requiredSEQ.at(seqI);
const std::vector<std::string> kMersInSegment = partitionStringIntokMers(SEQ_sequence, k);
for(unsigned int kMerI = 0; kMerI < kMersInSegment.size(); kMerI++)
{
const std::string& kMerSeq = kMersInSegment.at(kMerI);
local_forReturn[seqI][kMerSeq][kMerI]++;
}
}
#pragma omp critical
{
forReturn.insert(local_forReturn.begin(), local_forReturn.end());
}
}
return 0;
}
std::string generateRandomSequence(int length)
{
std::string forReturn;
forReturn.resize(length);
for(int i = 0; i < length; i++)
{
forReturn.at(i) = randomNucleotide();
}
return forReturn;
}
char randomNucleotide()
{
char nucleotides[4] = {'A', 'C', 'G', 'T'};
int n = rand() % 4;
assert((n >= 0) && (n <= 3));
return nucleotides[n];
}
std::vector<std::string> partitionStringIntokMers(std::string str, int k)
{
std::vector<std::string> forReturn;
if((int)str.length() >= k)
{
forReturn.resize((str.length() - k)+1);
for(int i = 0; i <= (int)(str.length() - k); i++)
{
std::string kMer = str.substr(i, k);
assert((int)kMer.length() == k);
forReturn.at(i) = kMer;
}
}
return forReturn;
}

Comprehensive vector vs linked list benchmark for randomized insertions/deletions

So I am aware of this question, and others on SO that deal with issue, but most of those deal with the complexities of the data structures (just to copy here, linked this theoretically has O(
I understand the complexities would seem to indicate that a list would be better, but I am more concerned with the real world performance.
Note: This question was inspired by slides 45 and 46 of Bjarne Stroustrup's presentation at Going Native 2012 where he talks about how processor caching and locality of reference really help with vectors, but not at all (or enough) with lists.
Question: Is there a good way to test this using CPU time as opposed to wall time, and getting a decent way of "randomly" inserting and deleting elements that can be done beforehand so it does not influence the timings?
As a bonus, it would be nice to be able to apply this to two arbitrary data structures (say vector and hash maps or something like that) to find the "real world performance" on some hardware.

I guess if I were going to test something like this, I'd probably start with code something on this order:
#include <list>
#include <vector>
#include <algorithm>
#include <deque>
#include <time.h>
#include <iostream>
#include <iterator>
static const int size = 30000;
template <class T>
double insert(T &container) {
srand(1234);
clock_t start = clock();
for (int i=0; i<size; ++i) {
int value = rand();
T::iterator pos = std::lower_bound(container.begin(), container.end(), value);
container.insert(pos, value);
}
// uncomment the following to verify correct insertion (in a small container).
// std::copy(container.begin(), container.end(), std::ostream_iterator<int>(std::cout, "\t"));
return double(clock()-start)/CLOCKS_PER_SEC;
}
template <class T>
double del(T &container) {
srand(1234);
clock_t start = clock();
for (int i=0; i<size/2; ++i) {
int value = rand();
T::iterator pos = std::lower_bound(container.begin(), container.end(), value);
container.erase(pos);
}
return double(clock()-start)/CLOCKS_PER_SEC;
}
int main() {
std::list<int> l;
std::vector<int> v;
std::deque<int> d;
std::cout << "Insertion time for list: " << insert(l) << "\n";
std::cout << "Insertion time for vector: " << insert(v) << "\n";
std::cout << "Insertion time for deque: " << insert(d) << "\n\n";
std::cout << "Deletion time for list: " << del(l) << '\n';
std::cout << "Deletion time for vector: " << del(v) << '\n';
std::cout << "Deletion time for deque: " << del(d) << '\n';
return 0;
}
Since it uses clock, this should give processor time not wall time (though some compilers such as MS VC++ get that wrong). It doesn't try to measure the time for insertion exclusive of time to find the insertion point, since 1) that would take a bit more work and 2) I still can't figure out what it would accomplish. It's certainly not 100% rigorous, but given the disparity I see from it, I'd be a bit surprised to see a significant difference from more careful testing. For example, with MS VC++, I get:
Insertion time for list: 6.598
Insertion time for vector: 1.377
Insertion time for deque: 1.484
Deletion time for list: 6.348
Deletion time for vector: 0.114
Deletion time for deque: 0.82
With gcc I get:
Insertion time for list: 5.272
Insertion time for vector: 0.125
Insertion time for deque: 0.125
Deletion time for list: 4.259
Deletion time for vector: 0.109
Deletion time for deque: 0.109
Factoring out the search time would be somewhat non-trivial because you'd have to time each iteration separately. You'd need something more precise than clock (usually is) to produce meaningful results from that (more on the order or reading a clock cycle register). Feel free to modify for that if you see fit -- as I mentioned above, I lack motivation because I can't see how it's a sensible thing to do.

This is the program I wrote after watching that talk. I tried running each timing test in a separate process to make sure the allocators weren't doing anything sneaky to alter performance. I have amended the test allow timing of the random number generation. If you are concerned it is affecting the results significantly, you can time it and subtract out the time spent there from the rest of the timings. But I get zero time spent there for anything but very large N. I used getrusage() which I am pretty sure isn't portable to Windows but it would be easy to substitute in something using clock() or whatever you like.
#include <assert.h>
#include <algorithm>
#include <iostream>
#include <list>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>
void f(size_t const N)
{
std::vector<int> c;
//c.reserve(N);
for (size_t i = 0; i < N; ++i) {
int r = rand();
auto p = std::find_if(c.begin(), c.end(), [=](int a) { return a >= r; });
c.insert(p, r);
}
}
void g(size_t const N)
{
std::list<int> c;
for (size_t i = 0; i < N; ++i) {
int r = rand();
auto p = std::find_if(c.begin(), c.end(), [=](int a) { return a >= r; });
c.insert(p, r);
}
}
int h(size_t const N)
{
int r;
for (size_t i = 0; i < N; ++i) {
r = rand();
}
return r;
}
double usage()
{
struct rusage u;
if (getrusage(RUSAGE_SELF, &u) == -1) std::abort();
return
double(u.ru_utime.tv_sec) + (u.ru_utime.tv_usec / 1e6) +
double(u.ru_stime.tv_sec) + (u.ru_stime.tv_usec / 1e6);
}
int
main(int argc, char* argv[])
{
assert(argc >= 3);
std::string const sel = argv[1];
size_t const N = atoi(argv[2]);
double t0, t1;
srand(127);
if (sel == "vector") {
t0 = usage();
f(N);
t1 = usage();
} else if (sel == "list") {
t0 = usage();
g(N);
t1 = usage();
} else if (sel == "rand") {
t0 = usage();
h(N);
t1 = usage();
} else {
std::abort();
}
std::cout
<< (t1 - t0)
<< std::endl;
return 0;
}
To get a set of results I used the following shell script.
seq=`perl -e 'for ($i = 10; $i < 100000; $i *= 1.1) { print int($i), " "; }'`
for i in $seq; do
vt=`./a.out vector $i`
lt=`./a.out list $i`
echo $i $vt $lt
done

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is unordered_map and map giving the same performance? - c++

Related

Is accessing container element time-consuming?

Getting speed improvement with OpenMP in nested for loops with dependencies

Use std::chrono::high_resolution_clock to measure std::lower_bound execution time?

C++ multithreading with openMP: poor performance despite localized variables (false sharing?)

Comprehensive vector vs linked list benchmark for randomized insertions/deletions

Categories

Resources