I read this nice experiment comparing, in particular, the performance of calling insert() on both a vector and a deque container. The result from that particular experiment (Experiment 4) was that deque is vastly superior for this operation.
I implemented my own test using a short sorting function I wrote, which I should note uses the [] operator along with other member functions, and found vastly different results. For example, for inserting 100,000 elements, vector took 24.88 seconds, while deque took 374.35 seconds.
How can I explain this? I imagine it has something to do with my sorting function, but would like the details!
I'm using g++ 4.6 with no optimizations.
Here's the program:
#include <iostream>
#include <vector>
#include <deque>
#include <cstdlib>
#include <ctime>
using namespace std;
size_t InsertionIndex(vector<double>& vec, double toInsert) {
for (size_t i = 0; i < vec.size(); ++i)
if (toInsert < vec[i])
return i;
return vec.size(); // return last index+1 if toInsert is largest yet
}
size_t InsertionIndex(deque<double>& deq, double toInsert) {
for (size_t i = 0; i < deq.size(); ++i)
if (toInsert < deq[i])
return i;
return deq.size(); // return last index+1 if toInsert is largest yet
}
int main() {
vector<double> vec;
deque<double> deq;
size_t N = 100000;
clock_t tic = clock();
for(int i = 0; i < N; ++i) {
double val = rand();
vec.insert(vec.begin() + InsertionIndex(vec, val), val);
// deq.insert(deq.begin() + InsertionIndex(deq, val), val);
}
float total = (float)(clock() - tic) / CLOCKS_PER_SEC;
cout << total << endl;
}
The special case where deque can be much faster than vector is when you're inserting at the front of the container. In your case you're inserting at random locations, which will actually give the advantage to vector.
Also unless you're using an optimized build, it's quite possible that there are bounds checks in the library implementation. Those checks can add significantly to the time. To do a proper benchmark comparison you must run with all normal optimizations turned on and debug turned off.
Your code is performing an insertion sort, which is O(n^2). Iterating over a deque is slower than iterating over a vector.
I suspect the reason you are not seeing the same result as the posted link is because the run-time of your program is dominated by the loop in InsertionIndex not the call to deque::insert (or vector::insert.
Related
I attempted to make my own sorting algorithm (call it MySort for now) and benchmark it against the sorting times of QuickSort. I use a random number generator to make an input file containing n random numbers, then provide this file as input to both MySort and QuickSort, and use std::chrono to time the time they take individually.
(At first I used an online compiler to check the times, but when I hit the limit of 10000 characters as input, I switched to doing it myself on my PC.)
So, for the first few tries (100 elements, 1000 elements, 10000 elements, 100000 elements), everything is working fine. I am getting a proper output time for the amount of time each sorting algorithm takes, but when I try to use 1000000 elements, QuickSort just doesn't give any output (does not seem to work at all), which is strange, because MySort worked just fine. I don't think it is a space issue, since MySort uses 2n additional space and works just fine.
The implementation of QuickSort I am using is given below:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
void quick_sort(int[],int,int);
int partition(int[],int,int);
int main()
{
int n,i;
cin>>n;
int a[n];
for(i=0;i<n;i++)
cin>>a[i];
auto start = high_resolution_clock::now();
quick_sort(a,0,n-1);
auto stop = high_resolution_clock::now();
duration <double, micro> d = stop - start;
cout<<"Time taken = "<<d.count()<<endl;
/*
cout<<"\nArray after sorting:";
for(i=0;i<n;i++)
cout<<a[i]<<endl;
*/
return 0;
}
void quick_sort(int a[],int l,int u)
{
int j;
if(l<u)
{
j=partition(a,l,u);
quick_sort(a,l,j-1);
quick_sort(a,j+1,u);
}
}
int partition(int a[],int l,int u)
{
int v,i,j,temp;
v=a[l];
i=l;
j=u+1;
do
{
do
i++;
while(a[i]<v&&i<=u);
do
j--;
while(v<a[j]);
if(i<j)
{
temp=a[i];
a[i]=a[j];
a[j]=temp;
}
}while(i<j);
a[l]=a[j];
a[j]=v;
return(j);
}
I tried looking around for solutions as to why it refuses to work for a million elements, but found nothing, besides the possibility that it may be a space issue, which seems unlikely to me considering MySort is working.
As for what exactly I get as output on feeding 1000000 elements in, when I execute both files on the command line, the output I get is (both run twice):
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 512129
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 516131
C:\Users\Zac\Desktop>QuickSortTest <output.txt
C:\Users\Zac\Desktop>QuickSortTest <output.txt
C:\Users\Zac\Desktop>
However, if I run them both for only 100000 elements each, this is what I get:
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 76897.1
C:\Users\Zac\Desktop>MySortTest <output.txt
Time Taken = 74019.4
C:\Users\Zac\Desktop>QuickSortTest <output.txt
Time taken = 16880.2
C:\Users\Zac\Desktop>QuickSortTest <output.txt
Time taken = 18005.3
C:\Users\Zac\Desktop>
Seems to be working fine.
I am at my wits end, any suggestions would be wonderful.
cin>>n;
int a[n];
This is your bug. You should never do this for three reasons.
This is not valid C++. In C++, the dimension of any array should be a constant expression. You are fooled by a non-conformant extension of gcc. Your code will fail to compile with other compilers. You should always use gcc (and clang) in high conformance mode. For C++, it would be g++ -std=c++17 -Wall -pedantic-errors
A large array local to a function is likely to provoke a stack overflow, since local variables are normally allocated on the stack and stack memory is usually very limited.
C-style arrays are bad, mkay? They don't know their own size, they cannot be easily checked for out-of-bounds access (std::vector and std::array have at() bounds-checking member functions), and they cannot be assigned or passed to functions or returned from functions. Use std::vector instead (or maybe std::array when the size is known in advance).
Let's remove the VLA's you're using and use std::vector. Here is what the code looks like with a sample data of 10 items (but with a check for boundary conditions).
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
using namespace std::chrono;
using vint = std::vector<int>;
void quick_sort(vint&, int, int);
int partition(vint&, int, int);
int main()
{
int n = 10, i;
vint a = { 7, 43, 2, 1, 6, 34, 987, 23, 0, 6 };
auto start = high_resolution_clock::now();
quick_sort(a, 0, n - 1);
auto stop = high_resolution_clock::now();
duration <double, micro> d = stop - start;
cout << "Time taken = " << d.count() << endl;
return 0;
}
void quick_sort(vint& a, int l, int u)
{
int j;
if (l < u)
{
j = partition(a, l, u);
quick_sort(a, l, j - 1);
quick_sort(a, j + 1, u);
}
}
int partition(vint& a, int l, int u)
{
int v, i, j, temp;
v = a[l];
i = l;
j = u + 1;
do
{
do
i++;
while (a.at(i) < v&&i <= u);
do
j--;
while (v < a[j]);
if (i < j)
{
temp = a[i];
a[i] = a[j];
a[j] = temp;
}
} while (i < j);
a[l] = a[j];
a[j] = v;
return(j);
}
Live Example.
You see that a std::out_of_range error is thrown on the line with the std::vector.at() call.
Bottom line -- your code was flawed to begin with -- whether it was 10, 100, or a million items. You are going out of bounds, thus the behavior is undefined. Usage of std::vector and at() detected the error, something that VLA's will not give you.
Besides VLA, your Quicksort always choose pivot as the first one. This may lead it to perform bad for worst cases. I don't know your output.txt but if the array has been already sorted, it runs O(n^2) because every partitioning would split into one element and the rest(half and half is the best). I think this is why it does not give any outputs for big inputs.
So I would suggest a couple of pivot-choosing heuristics that are commonly used.
Choose it randomly
Choose the median from the 3 elements - lowest/middle/highest index (a[l] / v[(l+u)/2] / v[u])
Once you choose a pivot, you can just simply swap it with v[lo] which minimizes your code changes.
I notice that vector is much slower than bool array when running the following code.
int main()
{
int count = 0;
int n = 1500000;
// slower with c++ vector<bool>
/*vector<bool> isPrime;
isPrime.reserve(n);
isPrime.assign(n, true);
*/
// faster with bool array
bool* isPrime = new bool[n];
for (int i = 0; i < n; ++i)
isPrime[i] = true;
for (int i = 2; i< n; ++i) {
if (isPrime[i])
count++;
for (int j =2; i*j < n; ++j )
isPrime[i*j] = false;
}
cout << count << endl;
return 0;
}
Is there some way that I can do to make vector<bool> faster ? Btw, both std::vector::push_back and std::vector::emplace_back are even slower than std::vector::assign.
std::vector<bool> can have various performance issues (e.g. take a look at https://isocpp.org/blog/2012/11/on-vectorbool).
In general you can:
use std::vector<std::uint8_t> instead of std::vector<bool> (give a try to std::valarray<bool> also).
This requires more memory and is less cache-friendly but there isn't a overhead (in the form of bit manipulation) to access a single value, so there are situations in which it works better (after all it's just like your array of bool but without the nuisance of memory management)
use std::bitset if you know at compile time how large your boolean array is going to be (or if you can at least establish a reasonable upper bound)
if Boost is an option try boost::dynamic_bitset (the size can be specified at runtime)
But for speed optimizations you have to test...
With your specific example I can confirm a performance difference only when optimizations are turned off (of course this isn't the way to go).
Some tests with g++ v4.8.3 and clang++ v3.4.5 on an Intel Xeon system (-O3 optimization level) give a different picture:
time (ms)
G++ CLANG++
array of bool 3103 3010
vector<bool> 2835 2420 // not bad!
vector<char> 3136 3031 // same as array of bool
bitset 2742 2388 // marginally better
(time elapsed for 100 runs of the code in the answer)
std::vector<bool> doesn't look so bad (source code here).
vector<bool> may have a template specialization and may be implemented using bit array to save space. Extracting and saving a bit and converting it from / to bool may cause the performance drop you are observing. If you use std::vector::push_back, you are resizing the vector which will cause even worse performance. Next performance killer may be assign (Worst complexity: Linear of first argument), instead use operator [] (Complexity: constant).
On the other hand, bool [] is guaranteed to be array of bool.
And you should resize to n instead of n-1 to avoid undefined behaviour.
vector<bool> can be high performance, but isn't required to be. For vector<bool> to be efficient, it needs to operate on many bools at a time (e.g. isPrime.assign(n, true)), and the implementor has had to put loving care into it. Indexing individual bools in a vector<bool> is slow.
Here is a prime finder that I wrote a while back using vector<bool> and clang + libc++ (the libc++ part is important):
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
std::vector<bool>
init_primes()
{
std::vector<bool> primes(0x80000000, true);
primes[0] = false;
primes[1] = false;
const auto pb = primes.begin();
const auto pe = primes.end();
const auto sz = primes.size();
size_t i = 2;
while (true)
{
size_t j = i*i;
if (j >= sz)
break;
do
{
primes[j] = false;
j += i;
} while (j < sz);
i = std::find(pb + (i+1), pe, true) - pb;
}
return primes;
}
int
main()
{
using namespace std::chrono;
using dsec = duration<double>;
auto t0 = steady_clock::now();
auto p = init_primes();
auto t1 = steady_clock::now();
std::cout << dsec(t1-t0).count() << "\n";
}
This executes for me in about 28s (-O3). When I change it to return a vector<char> instead, the execution time goes up to about 44s.
If you run this using some other std::lib, you probably won't see this trend. On libc++ algorithms such as std::find have been optimized to search a word of bits at a time, instead of bit at a time.
See http://howardhinnant.github.io/onvectorbool.html for more details on what std algorithms could be optimized by your vendor.
I need a blazing fast way to find the 2D positions and values of the M largest elements in an NxN array.
right now I'm doing this:
struct SourcePoint {
Point point;
float value;
}
SourcePoint* maxValues = new SourcePoint[ M ];
maxCoefficients = new SourcePoint*[
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
if (sample > maxValues[0].value) {
int q = 1;
while ( sample > maxValues[q].value && q < M ) {
maxValues[q-1] = maxValues[q]; // shuffle the values back
q++;
}
maxValues[q-1].value = sample;
maxValues[q-1].point = Point(i,j);
}
}
}
A Point struct is just two ints - x and y.
This code basically does an insertion sort of the values coming in. maxValues[0] always contains the SourcePoint with the lowest value that still keeps it within the top M values encoutered so far. This gives us a quick and easy bailout if sample <= maxValues, we don't do anything. The issue I'm having is the shuffling every time a new better value is found. It works its way all the way down maxValues until it finds it's spot, shuffling all the elements in maxValues to make room for itself.
I'm getting to the point where I'm ready to look into SIMD solutions, or cache optimisations, since it looks like there's a fair bit of cache thrashing happening. Cutting the cost of this operation down will dramatically affect the performance of my overall algorithm since this is called many many times and accounts for 60-80% of my overall cost.
I've tried using a std::vector and make_heap, but I think the overhead for creating the heap outweighed the savings of the heap operations. This is likely because M and N generally aren't large. M is typically 10-20 and N 10-30 (NxN 100 - 900). The issue is this operation is called repeatedly, and it can't be precomputed.
I just had a thought to pre-load the first M elements of maxValues which may provide some small savings. In the current algorithm, the first M elements are guaranteed to shuffle themselves all the way down just to initially fill maxValues.
Any help from optimization gurus would be much appreciated :)
A few ideas you can try. In some quick tests with N=100 and M=15 I was able to get it around 25% faster in VC++ 2010 but test it yourself to see whether any of them help in your case. Some of these changes may have no or even a negative effect depending on the actual usage/data and compiler optimizations.
Don't allocate a new maxValues array each time unless you need to. Using a stack variable instead of dynamic allocation gets me +5%.
Changing g_Source[i][j] to g_Source[j][i] gains you a very little bit (not as much as I'd thought there would be).
Using the structure SourcePoint1 listed at the bottom gets me another few percent.
The biggest gain of around +15% was to replace the local variable sample with g_Source[j][i]. The compiler is likely smart enough to optimize out the multiple reads to the array which it can't do if you use a local variable.
Trying a simple binary search netted me a small loss of a few percent. For larger M/Ns you'd likely see a benefit.
If possible try to keep the source data in arr[][] sorted, even if only partially. Ideally you'd want to generate maxValues[] at the same time the source data is created.
Look at how the data is created/stored/organized may give you patterns or information to reduce the amount of time to generate your maxValues[] array. For example, in the best case you could come up with a formula that gives you the top M coordinates without needing to iterate and sort.
Code for above:
struct SourcePoint1 {
int x;
int y;
float value;
int test; //Play with manual/compiler padding if needed
};
If you want to go into micro-optimizations at this point, the a simple first step should be to get rid of the Points and just stuff both dimensions into a single int. That reduces the amount of data you need to shift around, and gets SourcePoint down to being a power of two long, which simplifies indexing into it.
Also, are you sure that keeping the list sorted is better than simply recomputing which element is the new lowest after each time you shift the old lowest out?
(Updated 22:37 UTC 2011-08-20)
I propose a binary min-heap of fixed size holding the M largest elements (but still in min-heap order!). It probably won't be faster in practice, as I think OPs insertion sort probably has decent real world performance (at least when the recommendations of the other posteres in this thread are taken into account).
Look-up in the case of failure should be constant time: If the current element is less than the minimum element of the heap (containing the max M elements) we can reject it outright.
If it turns out that we have an element bigger than the current minimum of the heap (the Mth biggest element) we extract (discard) the previous min and insert the new element.
If the elements are needed in sorted order the heap can be sorted afterwards.
First attempt at a minimal C++ implementation:
template<unsigned size, typename T>
class m_heap {
private:
T nodes[size];
static const unsigned last = size - 1;
static unsigned parent(unsigned i) { return (i - 1) / 2; }
static unsigned left(unsigned i) { return i * 2; }
static unsigned right(unsigned i) { return i * 2 + 1; }
void bubble_down(unsigned int i) {
for (;;) {
unsigned j = i;
if (left(i) < size && nodes[left(i)] < nodes[i])
j = left(i);
if (right(i) < size && nodes[right(i)] < nodes[j])
j = right(i);
if (i != j) {
swap(nodes[i], nodes[j]);
i = j;
} else {
break;
}
}
}
void bubble_up(unsigned i) {
while (i > 0 && nodes[i] < nodes[parent(i)]) {
swap(nodes[parent(i)], nodes[i]);
i = parent(i);
}
}
public:
m_heap() {
for (unsigned i = 0; i < size; i++) {
nodes[i] = numeric_limits<T>::min();
}
}
void add(const T& x) {
if (x < nodes[0]) {
// reject outright
return;
}
nodes[0] = x;
swap(nodes[0], nodes[last]);
bubble_down(0);
}
};
Small test/usage case:
#include <iostream>
#include <limits>
#include <algorithm>
#include <vector>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
using namespace std;
// INCLUDE TEMPLATED CLASS FROM ABOVE
typedef vector<float> vf;
bool compare(float a, float b) { return a > b; }
int main()
{
int N = 2000;
vf v;
for (int i = 0; i < N; i++) v.push_back( rand()*1e6 / RAND_MAX);
static const int M = 50;
m_heap<M, float> h;
for (int i = 0; i < N; i++) h.add( v[i] );
sort(v.begin(), v.end(), compare);
vf heap(h.get(), h.get() + M); // assume public in m_heap: T* get() { return nodes; }
sort(heap.begin(), heap.end(), compare);
cout << "Real\tFake" << endl;
for (int i = 0; i < M; i++) {
cout << v[i] << "\t" << heap[i] << endl;
if (fabs(v[i] - heap[i]) > 1e-5) abort();
}
}
You're looking for a priority queue:
template < class T, class Container = vector<T>,
class Compare = less<typename Container::value_type> >
class priority_queue;
You'll need to figure out the best underlying container to use, and probably define a Compare function to deal with your Point type.
If you want to optimize it, you could run a queue on each row of your matrix in its own worker thread, then run an algorithm to pick the largest item of the queue fronts until you have your M elements.
A quick optimization would be to add a sentinel value to yourmaxValues array. If you have maxValues[M].value equal to std::numeric_limits<float>::max() then you can eliminate the q < M test in your while loop condition.
One idea would be to use the std::partial_sort algorithm on a plain one-dimensional sequence of references into your NxN array. You could probably also cache this sequence of references for subsequent calls. I don't know how well it performs, but it's worth a try - if it works good enough, you don't have as much "magic". In particular, you don't resort to micro optimizations.
Consider this showcase:
#include <algorithm>
#include <iostream>
#include <vector>
#include <stddef.h>
static const int M = 15;
static const int N = 20;
// Represents a reference to a sample of some two-dimensional array
class Sample
{
public:
Sample( float *arr, size_t row, size_t col )
: m_arr( arr ),
m_row( row ),
m_col( col )
{
}
inline operator float() const {
return m_arr[m_row * N + m_col];
}
bool operator<( const Sample &rhs ) const {
return (float)other < (float)*this;
}
int row() const {
return m_row;
}
int col() const {
return m_col;
}
private:
float *m_arr;
size_t m_row;
size_t m_col;
};
int main()
{
// Setup a demo array
float arr[N][N];
memset( arr, 0, sizeof( arr ) );
// Put in some sample values
arr[2][1] = 5.0;
arr[9][11] = 2.0;
arr[5][4] = 4.0;
arr[15][7] = 3.0;
arr[12][19] = 1.0;
// Setup the sequence of references into this array; you could keep
// a copy of this sequence around to reuse it later, I think.
std::vector<Sample> samples;
samples.reserve( N * N );
for ( size_t row = 0; row < N; ++row ) {
for ( size_t col = 0; col < N; ++col ) {
samples.push_back( Sample( (float *)arr, row, col ) );
}
}
// Let partial_sort find the M largest entry
std::partial_sort( samples.begin(), samples.begin() + M, samples.end() );
// Print out the row/column of the M largest entries.
for ( std::vector<Sample>::size_type i = 0; i < M; ++i ) {
std::cout << "#" << (i + 1) << " is " << (float)samples[i] << " at " << samples[i].row() << "/" << samples[i].col() << std::endl;
}
}
First of all, you are marching through the array in the wrong order!
You always, always, always want to scan through memory linearly. That means the last index of your array needs to be changing fastest. So instead of this:
for (int j = 0; j < rows; j++) {
for (int i = 0; i < cols; i++) {
float sample = arr[i][j];
Try this:
for (int i = 0; i < cols; i++) {
for (int j = 0; j < rows; j++) {
float sample = arr[i][j];
I predict this will make a bigger difference than any other single change.
Next, I would use a heap instead of a sorted array. The standard <algorithm> header already has push_heap and pop_heap functions to use a vector as a heap. (This will probably not help all that much, though, unless M is fairly large. For small M and a randomized array, you do not wind up doing all that many insertions on average... Something like O(log N) I believe.)
Next after that is to use SSE2. But that is peanuts compared to marching through memory in the right order.
You should be able to get nearly linear speedup with parallel processing.
With N CPUs, you can process a band of rows/N rows (and all columns) with each CPU, finding the top M entries in each band. And then do a selection sort to find the overall top M.
You could probably do that with SIMD as well (but here you'd divide up the task by interleaving columns instead of banding the rows). Don't try to make SIMD do your insertion sort faster, make it do more insertion sorts at once, which you combine at the end using a single very fast step.
Naturally you could do both multi-threading and SIMD, but on a problem which is only 30x30, that's not likely to be worthwhile.
I tried replacing float by double, and interestingly that gave me a speed improvement of about 20% (using VC++ 2008). That's a bit counterintuitive, but it seems modern processors or compilers are optimized for double value processing.
Use a linked list to store the best yet M values. You'll still have to iterate over it to find the right spot, but the insertion is O(1). It would probably even be better than binary search and insertion O(N)+O(1) vs O(lg(n))+O(N).
Interchange the fors, so you're not accessing every N element in memory and trashing the cache.
LE: Throwing another idea that might work for uniformly distributed values.
Find the min, max in 3/2*O(N^2) comparisons.
Create anywhere from N to N^2 uniformly distributed buckets, preferably closer to N^2 than N.
For every element in the NxN matrix place it in bucket[(int)(value-min)/range], range=max-min.
Finally create a set starting from the highest bucket to the lowest, add elements from other buckets to it while |current set| + |next bucket| <=M.
If you get M elements you're done.
You'll likely get less elements than M, let's say P.
Apply your algorithm for the remaining bucket and get biggest M-P elements out of it.
If elements are uniform and you use N^2 buckets it's complexity is about 3.5*(N^2) vs your current solution which is about O(N^2)*ln(M).
In the following example a std::map structure is filled with 26 values from A - Z (for key) and 0 - 26 for value. The time taken (on my system) to lookup the last entry (10000000 times) is roughly 250 ms for the vector, and 125 ms for the map. (I compiled using release mode, with O3 option turned on for g++ 4.4)
But if for some odd reason I wanted better performance than the std::map, what data structures and functions would I need to consider using?
I apologize if the answer seems obvious to you, but I haven't had much experience in the performance critical aspects of C++ programming.
#include <ctime>
#include <map>
#include <vector>
#include <iostream>
struct mystruct
{
char key;
int value;
mystruct(char k = 0, int v = 0) : key(k), value(v) { }
};
int find(const std::vector<mystruct>& ref, char key)
{
for (std::vector<mystruct>::const_iterator i = ref.begin(); i != ref.end(); ++i)
if (i->key == key) return i->value;
return -1;
}
int main()
{
std::map<char, int> mymap;
std::vector<mystruct> myvec;
for (int i = 'a'; i < 'a' + 26; ++i)
{
mymap[i] = i - 'a';
myvec.push_back(mystruct(i, i - 'a'));
}
int pre = clock();
for (int i = 0; i < 10000000; ++i)
{
find(myvec, 'z');
}
std::cout << "linear scan: milli " << clock() - pre << "\n";
pre = clock();
for (int i = 0; i < 10000000; ++i)
{
mymap['z'];
}
std::cout << "map scan: milli " << clock() - pre << "\n";
return 0;
}
For your example, use int value(char x) { return x - 'a'; }
More generalized, since the "keys" are continuous and dense, use an array (or vector) to guarantee Θ(1) access time.
If you don't need the keys to be sorted, use unordered_map, which should provide amortized logarithmic improvement (i.e. O(log n) -> O(1)) to most operations.
(Sometimes, esp. for small data sets, linear search is faster than hash table (unordered_map) / balanced binary trees (map) because the former has a much simpler algorithm, thus reducing the hidden constant in big-O. Profile, profile, profile.)
For starters, you should probably use std::map::find if you want to compare the search times; operator[] has additional functionality over and above the regular find.
Also, your data set is pretty small, which means that the whole vector will easily fit into the processor cache; a lot of modern processors are optimised for this sort of brute-force search so you'd end up getting fairly good performance. The map, while theoretically having better performance (O(log n) rather than O(n)) can't really exploit its advantage of the smaller number of comparisons because there aren't that many keys to compare against and the overhead of its data layout works against it.
TBH for data structures this small, the additional performance gain from not using a vector is often negligible. The "smarter" data structures like std::map come into play when you're dealing with larger amounts of data and a well distributed set of data that you are searching for.
If you really just have values for all entries from A to Z, why don't you use letter (properly adjusted) as the index into a vector?:
std::vector<int> direct_map;
direct_map.resize(26);
for (int i = 'a'; i < 'a' + 26; ++i)
{
direct_map[i - 'a']= i - 'a';
}
// ...
int find(const std::vector<int> &direct_map, char key)
{
int index= key - 'a';
if (index>=0 && index<direct_map.size())
return direct_map[index];
return -1;
}
I'm trying to optimize my C++ code. I've searched the internet on using dynamically allocated C++ arrays vs using std::vector and have generally seen a recommendation in favor of std::vector and that the difference in performance between the two is negligible. For instance here - Using arrays or std::vectors in C++, what's the performance gap?.
However, I wrote some code to test the performance of iterating through an array/vector and assigning values to the elements and I generally found that using dynamically allocated arrays was nearly 3 times faster than using vectors (I did specify a size for the vectors beforehand). I used g++-4.3.2.
However I feel that my test may have ignored issues I don't know about so I would appreciate any advice on this issue.
Thanks
Code used -
#include <time.h>
#include <iostream>
#include <vector>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
When benchmarking C++ comtainers, it's important to enable most compiler optimisations. Several of my own answers on SO have fallen foul of this - for example, the function call overhead when something like operator[] is not inlined can be very significant.
Just for fun, try iterating over the plain array using a pointer instead of an integer index (the code should look just like the vector iteration, since the point of STL iterators is to appear like pointer arithmetic for most operations). I bet the speed will be exactly equal in that case. Which of course means you should pick the vector, since it will save you a world of headaches from managing arrays by hand.
The thing about the standard library classes such as std::vector is that yes, naively, it is a lot more code than a raw array. But all of it can be trivially inlined by the compiler, which means that if optimizations are enabled, it becomes essentially the same code as if you'd used a raw array. The speed difference then is not negligible but non-existent. All the overhead is removed at compile-time.
But that requires compiler optimizations to be enabled.
I imagine the reason why you found iterating and adding to std::vector 3 times slower than a plain array is a combination of the cost of iterating the vector and doing the assigment.
Edit:
That was my initial assumption before the testcase; however running the testcase (compiled with -O3) shows the converse - std::vector is actually 3 times faster, which surprised me.
I can't see how std::vector could be faster (certainly not 3 times faster) than a vanilla array copy - I think there's some optimisation being applied to the std::vector compiled code which isn't happening for the array version.
Original benchmark results:
$ ./array
array: 0.059375
vector: 0.021209
std::vector is 3x faster. Same benchmark again, except add an additional outer loop to run the test iterater loop 1000 times:
$ ./array
array: 21.7129
vector: 21.6413
std::vector is now ~ the same speed as array.
Edit 2
Found it! So the problem with your test case is that in the vector case the memory holding the data appears to be already in the CPU cache - either by the way it is initialised, or due to the call to vec.end(). If I 'warm' up the CPU cache before each timing test, I get the same numbers for array and vector:
#include <time.h>
#include <iostream>
#include <vector>
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
// get vec into CPU cache.
for (int i = 0; vecIt != vecEnd; i++) { *(vecIt++) = i; }
vecIt = vec.begin();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
std::cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
int* arr = new int[9999999];
// get arr into CPU cache.
for (int i = 0; i < 9999999; i++) { arr[i] = i; }
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
std::cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
}
This gives me the following result:
$ ./array
vector: 0.020875
array: 0.020695
I agree with rmeador,
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i; // <-- quick offset calculation
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i; // <-- not fair play :) - offset = arr + i*size(int)
}
I think the answer here is obvious: it doesn't matter. Like jalf said the code will end up being about the same, but even if it wasn't, look at the numbers. The code you posted creates a huge array of 10 MILLION items, yet iterating over the entire array takes only a few hundredths of a second.
Even if your application really is working with that much data, whatever it is you're actually doing with that data is likely to take much more time than iterating over your array. Just use whichever data structure you prefer, and focus your time on the rest of your code.
To prove my point, here's the code with one change: the assignment of i to the array item is replaced with an assignment of sqrt(i). On my machine using -O2, the execution time triples from .02 to .06 seconds.
#include <time.h>
#include <iostream>
#include <vector>
#include <math.h>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = sqrt(i);
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
The issue seems to be that you compiled your code with optimizations turned off. On my machine, OS X 10.5.7 with g++ 4.0.1 I actually see that the vector is faster than primitive arrays by a factor of 2.5.
With gcc try to pass -O2 to the compiler and see if there's any improvement.
The reason that your array iterating is faster is that the the number of iteration is constant, and compiler is able to unroll the loop. Try to use rand to generate a number, and multiple it to be a big number you wanted so that compiler wont be able to figure it out at compile time. Then try it again, you will see similar runtime results.
One reason you're code might not be performing quite the same is because on your std::vector version, you are incrimenting two values, the integer i and the std::vector::iterator vecIt. To really be equivalent, you could refactor to
start = clock();
for (int i = 0; i < vec.size(); i++) {
vec[i] = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
Your code provides an unfair comparison between the two cases since you're doing far more work in the vector test than the array test.
With the vector, you're incrementing both the iterator (vecIT) and a separate variable (i) for generating the assignment values.
With the array, you're only incrementing the variable i and using it for dual purpose.