I'm trying to optimize my C++ code. I've searched the internet on using dynamically allocated C++ arrays vs using std::vector and have generally seen a recommendation in favor of std::vector and that the difference in performance between the two is negligible. For instance here - Using arrays or std::vectors in C++, what's the performance gap?.
However, I wrote some code to test the performance of iterating through an array/vector and assigning values to the elements and I generally found that using dynamically allocated arrays was nearly 3 times faster than using vectors (I did specify a size for the vectors beforehand). I used g++-4.3.2.
However I feel that my test may have ignored issues I don't know about so I would appreciate any advice on this issue.
Thanks
Code used -
#include <time.h>
#include <iostream>
#include <vector>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
When benchmarking C++ comtainers, it's important to enable most compiler optimisations. Several of my own answers on SO have fallen foul of this - for example, the function call overhead when something like operator[] is not inlined can be very significant.
Just for fun, try iterating over the plain array using a pointer instead of an integer index (the code should look just like the vector iteration, since the point of STL iterators is to appear like pointer arithmetic for most operations). I bet the speed will be exactly equal in that case. Which of course means you should pick the vector, since it will save you a world of headaches from managing arrays by hand.
The thing about the standard library classes such as std::vector is that yes, naively, it is a lot more code than a raw array. But all of it can be trivially inlined by the compiler, which means that if optimizations are enabled, it becomes essentially the same code as if you'd used a raw array. The speed difference then is not negligible but non-existent. All the overhead is removed at compile-time.
But that requires compiler optimizations to be enabled.
I imagine the reason why you found iterating and adding to std::vector 3 times slower than a plain array is a combination of the cost of iterating the vector and doing the assigment.
Edit:
That was my initial assumption before the testcase; however running the testcase (compiled with -O3) shows the converse - std::vector is actually 3 times faster, which surprised me.
I can't see how std::vector could be faster (certainly not 3 times faster) than a vanilla array copy - I think there's some optimisation being applied to the std::vector compiled code which isn't happening for the array version.
Original benchmark results:
$ ./array
array: 0.059375
vector: 0.021209
std::vector is 3x faster. Same benchmark again, except add an additional outer loop to run the test iterater loop 1000 times:
$ ./array
array: 21.7129
vector: 21.6413
std::vector is now ~ the same speed as array.
Edit 2
Found it! So the problem with your test case is that in the vector case the memory holding the data appears to be already in the CPU cache - either by the way it is initialised, or due to the call to vec.end(). If I 'warm' up the CPU cache before each timing test, I get the same numbers for array and vector:
#include <time.h>
#include <iostream>
#include <vector>
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
// get vec into CPU cache.
for (int i = 0; vecIt != vecEnd; i++) { *(vecIt++) = i; }
vecIt = vec.begin();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
std::cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
int* arr = new int[9999999];
// get arr into CPU cache.
for (int i = 0; i < 9999999; i++) { arr[i] = i; }
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
std::cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
}
This gives me the following result:
$ ./array
vector: 0.020875
array: 0.020695
I agree with rmeador,
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i; // <-- quick offset calculation
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i; // <-- not fair play :) - offset = arr + i*size(int)
}
I think the answer here is obvious: it doesn't matter. Like jalf said the code will end up being about the same, but even if it wasn't, look at the numbers. The code you posted creates a huge array of 10 MILLION items, yet iterating over the entire array takes only a few hundredths of a second.
Even if your application really is working with that much data, whatever it is you're actually doing with that data is likely to take much more time than iterating over your array. Just use whichever data structure you prefer, and focus your time on the rest of your code.
To prove my point, here's the code with one change: the assignment of i to the array item is replaced with an assignment of sqrt(i). On my machine using -O2, the execution time triples from .02 to .06 seconds.
#include <time.h>
#include <iostream>
#include <vector>
#include <math.h>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = sqrt(i);
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
The issue seems to be that you compiled your code with optimizations turned off. On my machine, OS X 10.5.7 with g++ 4.0.1 I actually see that the vector is faster than primitive arrays by a factor of 2.5.
With gcc try to pass -O2 to the compiler and see if there's any improvement.
The reason that your array iterating is faster is that the the number of iteration is constant, and compiler is able to unroll the loop. Try to use rand to generate a number, and multiple it to be a big number you wanted so that compiler wont be able to figure it out at compile time. Then try it again, you will see similar runtime results.
One reason you're code might not be performing quite the same is because on your std::vector version, you are incrimenting two values, the integer i and the std::vector::iterator vecIt. To really be equivalent, you could refactor to
start = clock();
for (int i = 0; i < vec.size(); i++) {
vec[i] = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
Your code provides an unfair comparison between the two cases since you're doing far more work in the vector test than the array test.
With the vector, you're incrementing both the iterator (vecIT) and a separate variable (i) for generating the assignment values.
With the array, you're only incrementing the variable i and using it for dual purpose.
Related
What is the fastest way access random (non-sequential) elements in an array if the access pattern is known beforehand? The access is random for different needs at every step so rearranging the elements is expensive option. The code below is represents important sample of the whole application.
#include <iostream>
#include "chrono"
#include <cstdlib>
#define NN 1000000
struct Astr{
double x[3], v[3];
int i, j, k;
long rank, p, q, r;
};
int main ()
{
struct Astr *key;
key = new Astr[NN];
int ii, *sequence;
sequence = new int[NN]; // access pattern is stored here
float frac ;
// create array of structs
// create array for random numbers between 0 to NN to access 'key'
for(int i=0; i < NN; i++){
key[i].x[1] = static_cast<double>(i);
key[i].p = static_cast<long>(i);
frac = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
sequence[i] = static_cast<int>(frac * static_cast<float>(NN));
}
// part to check and improve
// =========================================Random=======================================================
std::chrono::high_resolution_clock::time_point TstartMain = std::chrono::high_resolution_clock::now();
double tmp;
long rnk;
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
ii = sequence[i];
tmp = key[ii].x[1];
rnk = key[ii].p;
key[ii].x[1] = tmp * 1.01;
key[ii].p = rnk * 1.01;
}
std::chrono::high_resolution_clock::time_point TendMain = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
double time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << "\n Random array access " << time_uni << "s \n" ;
// ==========================================Sequential======================================================
TstartMain = std::chrono::high_resolution_clock::now();
for(int j=0; j < 1000; j++)
for(int i=0; i < NN; i++){
tmp = key[i].x[1];
rnk = key[i].p;
key[i].x[1] = tmp * 1.01;
key[i].p = rnk * 1.01;
}
TendMain = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>( TendMain - TstartMain );
time_uni = static_cast<double>(duration.count()) / 1000000;
std::cout << " Sequential array access " << time_uni << "s \n" ;
// ================================================================================================
delete [] key;
delete [] sequence;
}
As expected, sequential access is faster; the answer is following on my machine-
Random array access 21.3763s
Sequential array access 8.7755s
The main question is whether random access could be made any faster.
The code improvement could be in terms of the container itself ( e.g. list/vector rather than array). Could software prefetching be implemented?
In theory it is possible to help guide the pre-fetcher to speed up random access (well, on those CPU's that support it - e.g. _mm_prefetch for Intel/AMD). In practice however this is often a complete waste of time, and will more often than not, slow down your code.
The general theory is that you pass a pointer to the _mm_prefetch intrinsic a loop iteration or two prior to using the value. There are however problems with this:
It is likely that you'll end up tuning the code for your CPU. When running that same code on other platforms, you'll probably find that different CPU cache layouts/sizes mean that your prefetch optimisations are now actually slowing the performance down.
The additional prefetch instructions will end up using up more of your instruction cache, and most likely your uop cache as well. You may find this alone slows the code down.
This assumes the CPU actually pays attention to the _mm_prefetch instruction. It is only a hint, so there are no guarentees it will be respected by the CPU.
If you want to speed up random memory access, there are better methods than prefetching imho.
Reduce the size of the data (i.e. use shorts/float16s inplace of int/float, eradicate any erronious padding in your structs, etc). By reducing the size of the structs, you have less memory to read, so it will go quicker! (Simple compression schemes aren't a bad idea either!)
Sort your data so that instead of doing random access, you are processing the data sequentially.
Other than those two options, the best bet is to leave prefetching well alone, and the compiler do it's thing with your random access (The only exception: you are optimising code for a ~2001 Pentium 4, where prefetching was basically required).
To give an example of what #robthebloke says, the following code makes ~15% improvment on my machine:
#include <immintrin.h>
void do_it(struct Astr *key, const int *sequence) {
for(int i = 0; i < NN-8; ++i) {
_mm_prefetch(key + sequence[i+8], _MM_HINT_NTA);
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
for(int i = NN-8; i < NN; ++i) {
struct Astr *ki = key+sequence[i];
ki->x[1] *= 1.01;
ki->p *= 1.01;
}
}
I notice that vector is much slower than bool array when running the following code.
int main()
{
int count = 0;
int n = 1500000;
// slower with c++ vector<bool>
/*vector<bool> isPrime;
isPrime.reserve(n);
isPrime.assign(n, true);
*/
// faster with bool array
bool* isPrime = new bool[n];
for (int i = 0; i < n; ++i)
isPrime[i] = true;
for (int i = 2; i< n; ++i) {
if (isPrime[i])
count++;
for (int j =2; i*j < n; ++j )
isPrime[i*j] = false;
}
cout << count << endl;
return 0;
}
Is there some way that I can do to make vector<bool> faster ? Btw, both std::vector::push_back and std::vector::emplace_back are even slower than std::vector::assign.
std::vector<bool> can have various performance issues (e.g. take a look at https://isocpp.org/blog/2012/11/on-vectorbool).
In general you can:
use std::vector<std::uint8_t> instead of std::vector<bool> (give a try to std::valarray<bool> also).
This requires more memory and is less cache-friendly but there isn't a overhead (in the form of bit manipulation) to access a single value, so there are situations in which it works better (after all it's just like your array of bool but without the nuisance of memory management)
use std::bitset if you know at compile time how large your boolean array is going to be (or if you can at least establish a reasonable upper bound)
if Boost is an option try boost::dynamic_bitset (the size can be specified at runtime)
But for speed optimizations you have to test...
With your specific example I can confirm a performance difference only when optimizations are turned off (of course this isn't the way to go).
Some tests with g++ v4.8.3 and clang++ v3.4.5 on an Intel Xeon system (-O3 optimization level) give a different picture:
time (ms)
G++ CLANG++
array of bool 3103 3010
vector<bool> 2835 2420 // not bad!
vector<char> 3136 3031 // same as array of bool
bitset 2742 2388 // marginally better
(time elapsed for 100 runs of the code in the answer)
std::vector<bool> doesn't look so bad (source code here).
vector<bool> may have a template specialization and may be implemented using bit array to save space. Extracting and saving a bit and converting it from / to bool may cause the performance drop you are observing. If you use std::vector::push_back, you are resizing the vector which will cause even worse performance. Next performance killer may be assign (Worst complexity: Linear of first argument), instead use operator [] (Complexity: constant).
On the other hand, bool [] is guaranteed to be array of bool.
And you should resize to n instead of n-1 to avoid undefined behaviour.
vector<bool> can be high performance, but isn't required to be. For vector<bool> to be efficient, it needs to operate on many bools at a time (e.g. isPrime.assign(n, true)), and the implementor has had to put loving care into it. Indexing individual bools in a vector<bool> is slow.
Here is a prime finder that I wrote a while back using vector<bool> and clang + libc++ (the libc++ part is important):
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
std::vector<bool>
init_primes()
{
std::vector<bool> primes(0x80000000, true);
primes[0] = false;
primes[1] = false;
const auto pb = primes.begin();
const auto pe = primes.end();
const auto sz = primes.size();
size_t i = 2;
while (true)
{
size_t j = i*i;
if (j >= sz)
break;
do
{
primes[j] = false;
j += i;
} while (j < sz);
i = std::find(pb + (i+1), pe, true) - pb;
}
return primes;
}
int
main()
{
using namespace std::chrono;
using dsec = duration<double>;
auto t0 = steady_clock::now();
auto p = init_primes();
auto t1 = steady_clock::now();
std::cout << dsec(t1-t0).count() << "\n";
}
This executes for me in about 28s (-O3). When I change it to return a vector<char> instead, the execution time goes up to about 44s.
If you run this using some other std::lib, you probably won't see this trend. On libc++ algorithms such as std::find have been optimized to search a word of bits at a time, instead of bit at a time.
See http://howardhinnant.github.io/onvectorbool.html for more details on what std algorithms could be optimized by your vendor.
I read this nice experiment comparing, in particular, the performance of calling insert() on both a vector and a deque container. The result from that particular experiment (Experiment 4) was that deque is vastly superior for this operation.
I implemented my own test using a short sorting function I wrote, which I should note uses the [] operator along with other member functions, and found vastly different results. For example, for inserting 100,000 elements, vector took 24.88 seconds, while deque took 374.35 seconds.
How can I explain this? I imagine it has something to do with my sorting function, but would like the details!
I'm using g++ 4.6 with no optimizations.
Here's the program:
#include <iostream>
#include <vector>
#include <deque>
#include <cstdlib>
#include <ctime>
using namespace std;
size_t InsertionIndex(vector<double>& vec, double toInsert) {
for (size_t i = 0; i < vec.size(); ++i)
if (toInsert < vec[i])
return i;
return vec.size(); // return last index+1 if toInsert is largest yet
}
size_t InsertionIndex(deque<double>& deq, double toInsert) {
for (size_t i = 0; i < deq.size(); ++i)
if (toInsert < deq[i])
return i;
return deq.size(); // return last index+1 if toInsert is largest yet
}
int main() {
vector<double> vec;
deque<double> deq;
size_t N = 100000;
clock_t tic = clock();
for(int i = 0; i < N; ++i) {
double val = rand();
vec.insert(vec.begin() + InsertionIndex(vec, val), val);
// deq.insert(deq.begin() + InsertionIndex(deq, val), val);
}
float total = (float)(clock() - tic) / CLOCKS_PER_SEC;
cout << total << endl;
}
The special case where deque can be much faster than vector is when you're inserting at the front of the container. In your case you're inserting at random locations, which will actually give the advantage to vector.
Also unless you're using an optimized build, it's quite possible that there are bounds checks in the library implementation. Those checks can add significantly to the time. To do a proper benchmark comparison you must run with all normal optimizations turned on and debug turned off.
Your code is performing an insertion sort, which is O(n^2). Iterating over a deque is slower than iterating over a vector.
I suspect the reason you are not seeing the same result as the posted link is because the run-time of your program is dominated by the loop in InsertionIndex not the call to deque::insert (or vector::insert.
This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...
I have a nested for-loop structure and right now I am re-declaring the vector at the start of each iteration:
void function (n1,n2,bound,etc){
for (int i=0; i<bound; i++){
vector< vector<long long> > vec(n1, vector<long long>(n2));
//about three more for-loops here
}
}
This allows me to "start fresh" each iteration, which works great because my internal operations are largely in the form of vec[a][b] += some value. But I worry that it's slow for large n1 or large n2. I don't know the underlying architecture of vectors/arrays/etc so I am not sure what the fastest way is to handle this situation. Should I use an array instead? Should I clear it differently? Should I handle the logic differently altogether?
EDIT: The vector's size technically does not change each iteration (but it may change based on function parameters). I'm simply trying to clear it/etc so the program is as fast as humanly possible given all other circumstances.
EDIT:
My results of different methods:
Timings (for a sample set of data):
reclaring vector method: 111623 ms
clearing/resizing method: 126451 ms
looping/setting to 0 method: 88686 ms
I have a clear preference for small scopes (i.e. declaring the variable in the innermost loop if it’s only used there) but for large sizes this could cause a lot of allocations.
So if this loop is a performance problem, try declaring the variable outside the loop and merely clearing it inside the loop – however, this is only advantageous if the (reserved) size of the vector stays identical. If you are resizing the vector, then you get reallocations anyway.
Don’t use a raw array – it doesn’t give you any advantage, and only trouble.
Here is some code that tests a few different methods.
#include <chrono>
#include <iostream>
#include <vector>
int main()
{
typedef std::chrono::high_resolution_clock clock;
unsigned n1 = 1000;
unsigned n2 = 1000;
// Original method
{
auto start = clock::now();
for (unsigned i = 0; i < 10000; ++i)
{
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
// vec is initialized to zero already
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// reinitialize values to zero at every pass in the loop
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
// initialize vec to zero at the start of every loop
for (unsigned j = 0; j < n1; ++j)
for (unsigned k = 0; k < n2; ++k)
vec[j][k] = 0;
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// clearing the vector this way is not optimal since it will destruct the
// inner vectors
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
vec.clear();
vec.resize(n1, std::vector<long long>(n2));
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// equivalent to the second method from above
// no performace penalty
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
for (unsigned j = 0; j < n1; ++j)
{
vec[j].clear();
vec[j].resize(n2);
}
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
}
Edit: I've updated the code to make a fairer comparison between the methods.
Edit 2: Cleaned up the code a bit, methods 2 or 4 are the way to go.
Here are the timings of the above four methods on my computer:
16327389
15216024
16371469
15279471
The point is that you should try out different methods and profile your code.
When choosing a container i usually use this diagram to help me:
source
Other than that,
Like previously posted if this is causing performance problems declare the container outside of the for loop and just clear it at the start of each iteration
In addition to the previous comments :
if you use Robinson's swap method, you could go ever faster by handling that swap asynchronously.
Why not something like that :
{
vector< vector<long long> > vec(n1, vector<long long>(n2));
for (int i=0; i<bound; i++){
//about three more for-loops here
vec.clear();
}
}
Edit: added scope braces ;-)
Well if you are really concerned about performance (and you know the size of n1 and n2 beforehand) but don't want to use a C-style array, std::array may be your friend.
EDIT: Given your edit, it seems an std::array isn't an appropriate substitute since while the vector size does not change each iteration, it still isn't known before compilation.
Since you have to reset the vector values to 0 each iteration, in practical terms, this question boils down to "is the cost of allocating and deallocating the memory for the vector cheap or expensive compared to the computations inside the loops".
Assuming the computations are the expensive part of the algorithm, the way you've coded it is both clear, concise, shows the intended scope, and is probably just as fast as alternate approaches.
If however your computations and updates are extremely fast and the allocation/deallocation of the vector is relatively expensive, you could use std::fill to fill zeroes back into the array at the end/beginning of each iteration through the loop.
Of course the only way to know for sure is to measure with a profiler. I suspect you'll find that the approach you took won't show up as a hotspot of any sort and you should leave the obvious code in place.
The overhead of using a vector vs an array is minor, especially when you are getting a lot of useful functionality from the vector. Internally a vector allocates an array. So vector is the way to go.