C++: Optimizing speed with vector/array? - c++

I have a nested for-loop structure and right now I am re-declaring the vector at the start of each iteration:
void function (n1,n2,bound,etc){
for (int i=0; i<bound; i++){
vector< vector<long long> > vec(n1, vector<long long>(n2));
//about three more for-loops here
}
}
This allows me to "start fresh" each iteration, which works great because my internal operations are largely in the form of vec[a][b] += some value. But I worry that it's slow for large n1 or large n2. I don't know the underlying architecture of vectors/arrays/etc so I am not sure what the fastest way is to handle this situation. Should I use an array instead? Should I clear it differently? Should I handle the logic differently altogether?
EDIT: The vector's size technically does not change each iteration (but it may change based on function parameters). I'm simply trying to clear it/etc so the program is as fast as humanly possible given all other circumstances.
EDIT:
My results of different methods:
Timings (for a sample set of data):
reclaring vector method: 111623 ms
clearing/resizing method: 126451 ms
looping/setting to 0 method: 88686 ms

I have a clear preference for small scopes (i.e. declaring the variable in the innermost loop if it’s only used there) but for large sizes this could cause a lot of allocations.
So if this loop is a performance problem, try declaring the variable outside the loop and merely clearing it inside the loop – however, this is only advantageous if the (reserved) size of the vector stays identical. If you are resizing the vector, then you get reallocations anyway.
Don’t use a raw array – it doesn’t give you any advantage, and only trouble.

Here is some code that tests a few different methods.
#include <chrono>
#include <iostream>
#include <vector>
int main()
{
typedef std::chrono::high_resolution_clock clock;
unsigned n1 = 1000;
unsigned n2 = 1000;
// Original method
{
auto start = clock::now();
for (unsigned i = 0; i < 10000; ++i)
{
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
// vec is initialized to zero already
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// reinitialize values to zero at every pass in the loop
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
// initialize vec to zero at the start of every loop
for (unsigned j = 0; j < n1; ++j)
for (unsigned k = 0; k < n2; ++k)
vec[j][k] = 0;
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// clearing the vector this way is not optimal since it will destruct the
// inner vectors
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
vec.clear();
vec.resize(n1, std::vector<long long>(n2));
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// equivalent to the second method from above
// no performace penalty
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
for (unsigned j = 0; j < n1; ++j)
{
vec[j].clear();
vec[j].resize(n2);
}
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
}
Edit: I've updated the code to make a fairer comparison between the methods.
Edit 2: Cleaned up the code a bit, methods 2 or 4 are the way to go.
Here are the timings of the above four methods on my computer:
16327389
15216024
16371469
15279471
The point is that you should try out different methods and profile your code.

When choosing a container i usually use this diagram to help me:
source
Other than that,
Like previously posted if this is causing performance problems declare the container outside of the for loop and just clear it at the start of each iteration

In addition to the previous comments :
if you use Robinson's swap method, you could go ever faster by handling that swap asynchronously.

Why not something like that :
{
vector< vector<long long> > vec(n1, vector<long long>(n2));
for (int i=0; i<bound; i++){
//about three more for-loops here
vec.clear();
}
}
Edit: added scope braces ;-)

Well if you are really concerned about performance (and you know the size of n1 and n2 beforehand) but don't want to use a C-style array, std::array may be your friend.
EDIT: Given your edit, it seems an std::array isn't an appropriate substitute since while the vector size does not change each iteration, it still isn't known before compilation.

Since you have to reset the vector values to 0 each iteration, in practical terms, this question boils down to "is the cost of allocating and deallocating the memory for the vector cheap or expensive compared to the computations inside the loops".
Assuming the computations are the expensive part of the algorithm, the way you've coded it is both clear, concise, shows the intended scope, and is probably just as fast as alternate approaches.
If however your computations and updates are extremely fast and the allocation/deallocation of the vector is relatively expensive, you could use std::fill to fill zeroes back into the array at the end/beginning of each iteration through the loop.
Of course the only way to know for sure is to measure with a profiler. I suspect you'll find that the approach you took won't show up as a hotspot of any sort and you should leave the obvious code in place.

The overhead of using a vector vs an array is minor, especially when you are getting a lot of useful functionality from the vector. Internally a vector allocates an array. So vector is the way to go.

Related

Cache-friendliness std::list vs std::vector

With CPU caches becoming better and better std::vector usually outperforms std::list even when it comes to testing the strengths of a std::list. For this reason, even for situations where I need to delete/insert in the middle of the container I usually pick std::vector but I realized I had never tested this to make sure assumptions were correct. So I set up some test code:
#include <iostream>
#include <chrono>
#include <list>
#include <vector>
#include <random>
void TraversedDeletion()
{
std::random_device dv;
std::mt19937 mt{ dv() };
std::uniform_int_distribution<> dis(0, 100000000);
std::vector<int> vec;
for (int i = 0; i < 100000; ++i)
{
vec.emplace_back(dis(mt));
}
std::list<int> lis;
for (int i = 0; i < 100000; ++i)
{
lis.emplace_back(dis(mt));
}
{
std::cout << "Traversed deletion...\n";
std::cout << "Starting vector measurement...\n";
auto now = std::chrono::system_clock::now();
auto index = vec.size() / 2;
auto itr = vec.begin() + index;
for (int i = 0; i < 10000; ++i)
{
itr = vec.erase(itr);
}
std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
}
{
std::cout << "Starting list measurement...\n";
auto now = std::chrono::system_clock::now();
auto index = lis.size() / 2;
auto itr = lis.begin();
std::advance(itr, index);
for (int i = 0; i < 10000; ++i)
{
auto it = itr;
std::advance(itr, 1);
lis.erase(it);
}
std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
}
}
void RandomAccessDeletion()
{
std::random_device dv;
std::mt19937 mt{ dv() };
std::uniform_int_distribution<> dis(0, 100000000);
std::vector<int> vec;
for (int i = 0; i < 100000; ++i)
{
vec.emplace_back(dis(mt));
}
std::list<int> lis;
for (int i = 0; i < 100000; ++i)
{
lis.emplace_back(dis(mt));
}
std::cout << "Random access deletion...\n";
std::cout << "Starting vector measurement...\n";
std::uniform_int_distribution<> vect_dist(0, vec.size() - 10000);
auto now = std::chrono::system_clock::now();
for (int i = 0; i < 10000; ++i)
{
auto rand_index = vect_dist(mt);
auto itr = vec.begin();
std::advance(itr, rand_index);
vec.erase(itr);
}
std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
std::cout << "Starting list measurement...\n";
now = std::chrono::system_clock::now();
for (int i = 0; i < 10000; ++i)
{
auto rand_index = vect_dist(mt);
auto itr = lis.begin();
std::advance(itr, rand_index);
lis.erase(itr);
}
std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
}
int main()
{
RandomAccessDeletion();
TraversedDeletion();
std::cin.get();
}
All results are compiled with /02 (Maximize speed).
The first, RandomAccessDeletion(), generates a random index and erases this index 10.000 times. My assumptions were right and the vector is indeed a lot faster than the list:
Random access deletion...
Starting vector measurement...
Took 240299 μs
Starting list measurement...
Took 1368205 μs
The vector is about 5.6x faster than the list. We can most likely thank our cache overlords for this performance benefit, even though we need to shift the elements in the vector on every deletion it's impact is less than the lookup time of the list as we can see in the benchmark.
So then I added another test, seen in the TraversedDeletion(). It doesn't use randomized positions to delete but rather it picks an index in the middle of the container and uses that as base iterator, then traverse the container to erase 10.000 times.
My assumptions were that the list would outperform the vector only slightly or as fast as the vector.
The results for the same execution:
Traversed deletion...
Starting vector measurement....
Took 195477 μs
Starting list measurement...
Took 581 μs
Wow. The list is about 336x faster. This is really far off from my expectations. So having a few cache misses in the list doesn't seem to matter at all here as cutting the lookup time for the list weighs in way more.
So the list apparently still has a really strong position when it comes to performance for corner/unusual cases, or are my test cases flawed in some way?
Does this mean that the list nowadays is only a reasonable option for lots of insertions/deletions in the middle of a container when traversing or are there other cases?
Is there a way I could change the vector access & erasure in TraversedDeletion() to make it at least a bit more competition vs the list?
In response to #BoPersson's comment:
vec.erase(it, it+10000) would perform a lot better than doing 10000
separate deletes.
Changing:
for (int i = 0; i < 10000; ++i)
{
itr = vec.erase(itr);
}
To:
vec.erase(itr, itr + 10000);
Gave me:
Starting vector measurement...
Took 19 μs
This is a major improvement already.
In TraversedDeletion you are essentially doing a pop_front but instead of being at the front you are doing it in the middle. For a linked list this is not an issue. Deleting the node is a O(1) operation. Unfortunately when you do this in the vector is it a O(N) operation where N is vec.end() - itr. This is because it has to copy every element from deletion point forward one element. That is why it is so much more expensive in the vector case.
On the other hand in RandomAccessDeletion you are constantly changing the delete point. This means you have an O(N) operation to traverse the list to get to the node to delete and a O(1) to delete the node versus a O(1) traversersal to find the element and a O(N) operation to copy the elements in the vector forward. The reason this is not the same though is the cost to traverse from node to node has a higher constant than it takes to copy the elements in the vector.
The long duration for list in RandomDeletion is due to the time it takes to advance from the beginning of the list to the randomly selected element, an O(N) operation.
TraverseDeletion just increments an iterator, an O(1) operation.
The "fast" part about a vector is "reaching" the element which needs to be accessed (traversing). You don't actually traverse much on the vector in the deletion but only access the first element. ( I would say the adavance-by-one does not make much measurement wise)
The deletion then takes quite a lot of time ( O(n) so when deleting each one by itself it's O(n²) ) due to changing the elements in the memory. Because the deletion changes the memory on the locations after the deleted element you also cannot benefit from prefetching which also is a thing which makes the vector that fast.
I am not sure how much the deletion also would invalidate the caches because the memory beyond the iterator has changed but this can also have a very big impact on the performance.
In the first test, the list had to traverse to the point of deletion, then delete the entry. The time the list took was in traversing for each deletion.
In the second test, the list traversed once, then repeatedly deleted. The time taken was still in the traversal; the deletion was cheap. Except now we don't repeatedly traverse.
For the vector, traversal is free. Deletion takes time. Randomly deleting an element takes less time than it took for the list to traverse to that random element, so vector wins in the first case.
In the second case, the vector does the hard work many many more times than the list does it hard work.
But, the problem is that isn't how you should traverse-and-delete from a vector. It is an acceptable way to do it for a list.
The way you'd write this for a vector is std::remove_if, followed by erase. Or just one erase:
auto index = vec.size() / 2;
auto itr = vec.begin() + index;
vec.erase(itr, itr+10000);
Or, to emulate a more complex decision making process involving erasing elements:
auto index = vec.size() / 2;
auto itr = vec.begin() + index;
int count = 10000;
auto last = std::remove_if( itr, vec.end(),
[&count](auto&&){
if (count <= 0) return false;
--count;
return true;
}
);
vec.erase(last, vec.end());
Almost the only case where list is way faster than vector is when you store an iterator into the list, and you periodically erase at or near that iterator while still traversing the list between such erase actions.
Almost every other use case has a vector use-pattern that matches or exceeds list performance in my experience.
The code cannot always be translated line-for-line, as you have demonstrated.
Every time you erase an element in a vector, it moves the "tail" of the vector over 1.
If you erase 10,000 elements, it moves the "tail" of the vector over 10000 in one step.
If you remove_if, it removes the tail over efficiently, gives you the "wasted" remaining, and you can then remove the waste from the vector.
I want po point out something still not mentioned in this question:
In the std::vector, when you delete an element in the middle, the elements are moved thanks to new move semantics. That is one of the reasons the first test takes this speed, because you are not even copying the elements after the deleted iterator. You could reproduce the experiment with a vector and list of non-copiable type and see how the performance of the list (in comparation) is better.
I would suggest to run the same tests using a more complex data type in the std::vector: instead of an int, use a structure.
Even better use a static C array as a vector element, and then take measurements with different array sizes.
So, you could swap this line of your code:
std::vector<int> vec;
with something like:
const size_t size = 256;
struct TestType { int a[size]; };
std::vector<TestType> vec;
and test with different values of size. The behavior may depend on this parameter.

C++11 vector<bool> performance issue (with code example)

I notice that vector is much slower than bool array when running the following code.
int main()
{
int count = 0;
int n = 1500000;
// slower with c++ vector<bool>
/*vector<bool> isPrime;
isPrime.reserve(n);
isPrime.assign(n, true);
*/
// faster with bool array
bool* isPrime = new bool[n];
for (int i = 0; i < n; ++i)
isPrime[i] = true;
for (int i = 2; i< n; ++i) {
if (isPrime[i])
count++;
for (int j =2; i*j < n; ++j )
isPrime[i*j] = false;
}
cout << count << endl;
return 0;
}
Is there some way that I can do to make vector<bool> faster ? Btw, both std::vector::push_back and std::vector::emplace_back are even slower than std::vector::assign.
std::vector<bool> can have various performance issues (e.g. take a look at https://isocpp.org/blog/2012/11/on-vectorbool).
In general you can:
use std::vector<std::uint8_t> instead of std::vector<bool> (give a try to std::valarray<bool> also).
This requires more memory and is less cache-friendly but there isn't a overhead (in the form of bit manipulation) to access a single value, so there are situations in which it works better (after all it's just like your array of bool but without the nuisance of memory management)
use std::bitset if you know at compile time how large your boolean array is going to be (or if you can at least establish a reasonable upper bound)
if Boost is an option try boost::dynamic_bitset (the size can be specified at runtime)
But for speed optimizations you have to test...
With your specific example I can confirm a performance difference only when optimizations are turned off (of course this isn't the way to go).
Some tests with g++ v4.8.3 and clang++ v3.4.5 on an Intel Xeon system (-O3 optimization level) give a different picture:
time (ms)
G++ CLANG++
array of bool 3103 3010
vector<bool> 2835 2420 // not bad!
vector<char> 3136 3031 // same as array of bool
bitset 2742 2388 // marginally better
(time elapsed for 100 runs of the code in the answer)
std::vector<bool> doesn't look so bad (source code here).
vector<bool> may have a template specialization and may be implemented using bit array to save space. Extracting and saving a bit and converting it from / to bool may cause the performance drop you are observing. If you use std::vector::push_back, you are resizing the vector which will cause even worse performance. Next performance killer may be assign (Worst complexity: Linear of first argument), instead use operator [] (Complexity: constant).
On the other hand, bool [] is guaranteed to be array of bool.
And you should resize to n instead of n-1 to avoid undefined behaviour.
vector<bool> can be high performance, but isn't required to be. For vector<bool> to be efficient, it needs to operate on many bools at a time (e.g. isPrime.assign(n, true)), and the implementor has had to put loving care into it. Indexing individual bools in a vector<bool> is slow.
Here is a prime finder that I wrote a while back using vector<bool> and clang + libc++ (the libc++ part is important):
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
std::vector<bool>
init_primes()
{
std::vector<bool> primes(0x80000000, true);
primes[0] = false;
primes[1] = false;
const auto pb = primes.begin();
const auto pe = primes.end();
const auto sz = primes.size();
size_t i = 2;
while (true)
{
size_t j = i*i;
if (j >= sz)
break;
do
{
primes[j] = false;
j += i;
} while (j < sz);
i = std::find(pb + (i+1), pe, true) - pb;
}
return primes;
}
int
main()
{
using namespace std::chrono;
using dsec = duration<double>;
auto t0 = steady_clock::now();
auto p = init_primes();
auto t1 = steady_clock::now();
std::cout << dsec(t1-t0).count() << "\n";
}
This executes for me in about 28s (-O3). When I change it to return a vector<char> instead, the execution time goes up to about 44s.
If you run this using some other std::lib, you probably won't see this trend. On libc++ algorithms such as std::find have been optimized to search a word of bits at a time, instead of bit at a time.
See http://howardhinnant.github.io/onvectorbool.html for more details on what std algorithms could be optimized by your vendor.

call to condition on for loop (c++)

Here is a simple question I have been wondering about for a long time :
When I do a loop such as this one :
for (int i = 0; i < myVector.size() ; ++i) {
// my loop
}
As the condition i < myVector.size() is checked each time, should I store the size of the array inside a variable before the loop to prevent the call to size() each iteration ? Or is the compiler smart enough to do it itself ?
mySize = myVector.size();
for (int i = 0; i < mySize ; ++i) {
// my loop
}
And I would extend the question with a more complex condition such as i < myVector.front()/myVector.size()
Edit : I don't use myVector inside the loop, it is juste here to give the ending condition. And what about the more complex condition ?
The answer depends mainly on the contents of your loop–it may modify the vector during processing, thus modifying its size.
However if the vector is just scanned you can safely store its size in advance:
for (int i = 0, mySize = myVector.size(); i < mySize ; ++i) {
// my loop
}
although in most classes the functions like 'get current size' are just inline getters:
class XXX
{
public:
int size() const { return mSize; }
....
private:
int mSize;
....
};
so the compiler can easily reduce the call to just reading the int variable, consequently prefetching the length gives no gain.
If you are not changing anything in vector (adding/removing) during for-loop (which is normal case) I would use foreach loop
for (auto object : myVector)
{
//here some code
}
or if you cannot use c++11 I would use iterators
for (auto it = myVector.begin(); it != myVector.end(); ++it)
{
//here some code
}
I'd say that
for (int i = 0; i < myVector.size() ; ++i) {
// my loop
}
is a bit safer than
mySize = myVector.size();
for (int i = 0; i < mySize ; ++i) {
// my loop
}
because the value of myVector.size() may change (as result of , e.g. push_back(value) inside the loop) thus you might miss some of the elements.
If you are 100% sure that the value of myVector.size() is not going to change, then both are the same thing.
Yet, the first one is a bit more flexible than the second (other developer may be unaware that the loop iterates over fixed size and he might change the array size). Don't worry about the compiler, he's smarter than both of us combined.
The overhead is very small.
vector.size() does not recalculate anything, but simply returns the value of the private size variable..
it is safer than pre-buffering the value, as the vectors internal size variable is changed when an element is popped or pushed to/from the vector..
compilers can be written to optimize this out, if and only if, it can predict that the vector is not changed by ANYTHING while the for loop runs.
That is difficult to do if there are threads in there.
but if there isn't any threading going on, it's very easy to optimize it.
Any smart compiler will probably optimize this out. However just to be sure I usually lay out my for loops like this:
for (int i = myvector.size() -1; i >= 0; --i)
{
}
A couple of things are different:
The iteration is done the other way around. Although this shouldn't be a problem in most cases. If it is I prefer David Haim's method.
The --i is used rather than a i--. In theory the --i is faster, although on most compilers it won't make a difference.
If you don't care about the index this:
for (int i = myvector.size(); i > 0; --i)
{
}
Would also be an option. Altough in general I don't use it because it is a bit more confusing than the first. And will not gain you any performance.
For a type like a std::vector or std::list an iterator is the preffered method:
for (std::vector</*vectortype here*/>::iterator i = myVector.begin(); i != myVector.end(); ++i)
{
}

efficient way to copy array with mask in c++

I have two arrays. One is "x" factor the size of the second one.
I need to copy from the first (bigger) array to the second (smaller) array only its x element.
Meaning 0,x,2x.
Each array sits as a block in the memory.
The array is of simple values.
I am currently doing it using a loop.
Is there any faster smarter way to do this?
Maybe with ostream?
Thanks!
You are doing something like this right?
#include <cstddef>
int main()
{
const std::size_t N = 20;
const std::size_t x = 5;
int input[N*x];
int output[N];
for(std::size_t i = 0; i < N; ++i)
output[i] = input[i*x];
}
well, I don't know any function that can do that, so I would use the for loop. This is fast.
EDIT: even faster solution (to avoid multiplications)(C++03 Version)
int* inputit = input;
int* outputit = output;
int* outputend = output+N;
while(outputit != outputend)
{
*outputit = *inputit;
++outputit;
inputit+=x;
}
if I get you right you want to copy every n-th element. the simplest solution would be
#include <iostream>
int main(int argc, char **argv) {
const int size[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int out[5];
int *pout = out;
for (const int *i = &size[0]; i < &size[10]; i += 3) {
std::cout << *i << ", ";
*pout++ = *i;
if (pout > &out[4]) {
break;
}
}
std::cout << "\n";
for (const int *i = out; i < pout; i++) {
std::cout << *i << ", ";
}
std::cout << std::endl;
}
You can use copy_if and lambda in C++11:
copy_if(a.begin(), a.end(), b.end(), [&] (const int& i) -> bool
{ size_t index = &i - &a[0]; return index % x == 0; });
A test case would be:
#include <iostream>
#include <vector>
#include <algorithm> // std::copy_if
using namespace std;
int main()
{
std::vector<int> a;
a.push_back(0);
a.push_back(1);
a.push_back(2);
a.push_back(3);
a.push_back(4);
std::vector<int> b(3);
int x = 2;
std::copy_if(a.begin(), a.end(), b.begin(), [&] (const int& i) -> bool
{ size_t index = &i - &a[0]; return index % x == 0; });
for(int i=0; i<b.size(); i++)
{
std::cout<<" "<<b[i];
}
return 0;
}
Note that you need to use a C++11 compatible compiler (if gcc, with -std=c++11 option).
template<typename InIt, typename OutIt>
void copy_step_x(InIt first, InIt last, OutIt result, int x)
{
for(auto it = first; it != last; std::advance(it, x))
*result++ = *it;
}
int main()
{
std::array<int, 64> ar0;
std::array<int, 32> ar1;
copy_step_x(std::begin(ar0), std::end(ar0), std::begin(ar1), ar0.size() / ar1.size());
}
The proper and clean way of doing this is a loop like has been said before. A number of good answers here show you how to do that.
I do NOT recommend doing it in the following fashion, it depends on a lot of specific things, value range of X, size and value range of the variables and so on but for some you could do it like this:
for every 4 bytes:
tmp = copy a 32 bit variable from the array, this now contains the 4 new values
real_tmp = bitmask tmp to get the right variable of those 4
add it to the list
This only works if you want values <= 255 and X==4, but if you want something faster than a loop this is one way of doing it. This could be modified for 16bit, 32bit or 64bit values and every 2,3,4,5,6,7,8(64 bit) values but for X>8 this method will not work, or for values that are not allocated in a linear fashion. It won't work for classes either.
For this kind of optimization to be worth the hassle the code need to run often, I assume you've run a profiler to confirm that the old copy is a bottleneck before starting implementing something like this.
The following is an observation on how most CPU designs are unimaginative when it comes to this sort of thing.
On some OpenVPX you have the ability to DMA data from one processor to another. The one that I use has a pretty advanced DMA controller, and it can do this sort of thing for you.
For example, I could ask it to copy your big array to another CPU, but skipping over N elements of the array, just like you're trying to do. As if by magic the destination CPU would have the smaller array in its memory. I could also if I wanted perform matrix transformations, etc.
The nice thing is that it takes no CPU time at all to do this; it's all done by the DMA engine. My CPUs can then concentrate on harder sums instead of being tied down shuffling data around.
I think the Cell processor in the PS3 can do this sort of thing internally (I know it can DMA data around, I don't know if it will do the strip mining at the same time). Some DSP chips can do it too. But x86 doesn't do it, meaning us software programmers have to write ridiculous loops just moving data in simple patterns. Yawn.
I have written a multithreaded memcpy() in the past to do this sort of thing. The only way you're going to beat a for loop is to have several threads doing your for loop in several parallel chunks.
If you pick the right compiler (eg Intel's ICC or Sun/Oracles Sun Studio) they can be made to automatically parallelise your for loops on your behalf (so your source code doesn't change). That's probably the simplest way to beat your original for loop.

Dynamically allocated arrays or std::vector

I'm trying to optimize my C++ code. I've searched the internet on using dynamically allocated C++ arrays vs using std::vector and have generally seen a recommendation in favor of std::vector and that the difference in performance between the two is negligible. For instance here - Using arrays or std::vectors in C++, what's the performance gap?.
However, I wrote some code to test the performance of iterating through an array/vector and assigning values to the elements and I generally found that using dynamically allocated arrays was nearly 3 times faster than using vectors (I did specify a size for the vectors beforehand). I used g++-4.3.2.
However I feel that my test may have ignored issues I don't know about so I would appreciate any advice on this issue.
Thanks
Code used -
#include <time.h>
#include <iostream>
#include <vector>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
When benchmarking C++ comtainers, it's important to enable most compiler optimisations. Several of my own answers on SO have fallen foul of this - for example, the function call overhead when something like operator[] is not inlined can be very significant.
Just for fun, try iterating over the plain array using a pointer instead of an integer index (the code should look just like the vector iteration, since the point of STL iterators is to appear like pointer arithmetic for most operations). I bet the speed will be exactly equal in that case. Which of course means you should pick the vector, since it will save you a world of headaches from managing arrays by hand.
The thing about the standard library classes such as std::vector is that yes, naively, it is a lot more code than a raw array. But all of it can be trivially inlined by the compiler, which means that if optimizations are enabled, it becomes essentially the same code as if you'd used a raw array. The speed difference then is not negligible but non-existent. All the overhead is removed at compile-time.
But that requires compiler optimizations to be enabled.
I imagine the reason why you found iterating and adding to std::vector 3 times slower than a plain array is a combination of the cost of iterating the vector and doing the assigment.
Edit:
That was my initial assumption before the testcase; however running the testcase (compiled with -O3) shows the converse - std::vector is actually 3 times faster, which surprised me.
I can't see how std::vector could be faster (certainly not 3 times faster) than a vanilla array copy - I think there's some optimisation being applied to the std::vector compiled code which isn't happening for the array version.
Original benchmark results:
$ ./array
array: 0.059375
vector: 0.021209
std::vector is 3x faster. Same benchmark again, except add an additional outer loop to run the test iterater loop 1000 times:
$ ./array
array: 21.7129
vector: 21.6413
std::vector is now ~ the same speed as array.
Edit 2
Found it! So the problem with your test case is that in the vector case the memory holding the data appears to be already in the CPU cache - either by the way it is initialised, or due to the call to vec.end(). If I 'warm' up the CPU cache before each timing test, I get the same numbers for array and vector:
#include <time.h>
#include <iostream>
#include <vector>
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
// get vec into CPU cache.
for (int i = 0; vecIt != vecEnd; i++) { *(vecIt++) = i; }
vecIt = vec.begin();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
std::cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
int* arr = new int[9999999];
// get arr into CPU cache.
for (int i = 0; i < 9999999; i++) { arr[i] = i; }
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
std::cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
}
This gives me the following result:
$ ./array
vector: 0.020875
array: 0.020695
I agree with rmeador,
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i; // <-- quick offset calculation
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i; // <-- not fair play :) - offset = arr + i*size(int)
}
I think the answer here is obvious: it doesn't matter. Like jalf said the code will end up being about the same, but even if it wasn't, look at the numbers. The code you posted creates a huge array of 10 MILLION items, yet iterating over the entire array takes only a few hundredths of a second.
Even if your application really is working with that much data, whatever it is you're actually doing with that data is likely to take much more time than iterating over your array. Just use whichever data structure you prefer, and focus your time on the rest of your code.
To prove my point, here's the code with one change: the assignment of i to the array item is replaced with an assignment of sqrt(i). On my machine using -O2, the execution time triples from .02 to .06 seconds.
#include <time.h>
#include <iostream>
#include <vector>
#include <math.h>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = sqrt(i);
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
The issue seems to be that you compiled your code with optimizations turned off. On my machine, OS X 10.5.7 with g++ 4.0.1 I actually see that the vector is faster than primitive arrays by a factor of 2.5.
With gcc try to pass -O2 to the compiler and see if there's any improvement.
The reason that your array iterating is faster is that the the number of iteration is constant, and compiler is able to unroll the loop. Try to use rand to generate a number, and multiple it to be a big number you wanted so that compiler wont be able to figure it out at compile time. Then try it again, you will see similar runtime results.
One reason you're code might not be performing quite the same is because on your std::vector version, you are incrimenting two values, the integer i and the std::vector::iterator vecIt. To really be equivalent, you could refactor to
start = clock();
for (int i = 0; i < vec.size(); i++) {
vec[i] = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
Your code provides an unfair comparison between the two cases since you're doing far more work in the vector test than the array test.
With the vector, you're incrementing both the iterator (vecIT) and a separate variable (i) for generating the assignment values.
With the array, you're only incrementing the variable i and using it for dual purpose.