I notice that vector is much slower than bool array when running the following code.
int main()
{
int count = 0;
int n = 1500000;
// slower with c++ vector<bool>
/*vector<bool> isPrime;
isPrime.reserve(n);
isPrime.assign(n, true);
*/
// faster with bool array
bool* isPrime = new bool[n];
for (int i = 0; i < n; ++i)
isPrime[i] = true;
for (int i = 2; i< n; ++i) {
if (isPrime[i])
count++;
for (int j =2; i*j < n; ++j )
isPrime[i*j] = false;
}
cout << count << endl;
return 0;
}
Is there some way that I can do to make vector<bool> faster ? Btw, both std::vector::push_back and std::vector::emplace_back are even slower than std::vector::assign.
std::vector<bool> can have various performance issues (e.g. take a look at https://isocpp.org/blog/2012/11/on-vectorbool).
In general you can:
use std::vector<std::uint8_t> instead of std::vector<bool> (give a try to std::valarray<bool> also).
This requires more memory and is less cache-friendly but there isn't a overhead (in the form of bit manipulation) to access a single value, so there are situations in which it works better (after all it's just like your array of bool but without the nuisance of memory management)
use std::bitset if you know at compile time how large your boolean array is going to be (or if you can at least establish a reasonable upper bound)
if Boost is an option try boost::dynamic_bitset (the size can be specified at runtime)
But for speed optimizations you have to test...
With your specific example I can confirm a performance difference only when optimizations are turned off (of course this isn't the way to go).
Some tests with g++ v4.8.3 and clang++ v3.4.5 on an Intel Xeon system (-O3 optimization level) give a different picture:
time (ms)
G++ CLANG++
array of bool 3103 3010
vector<bool> 2835 2420 // not bad!
vector<char> 3136 3031 // same as array of bool
bitset 2742 2388 // marginally better
(time elapsed for 100 runs of the code in the answer)
std::vector<bool> doesn't look so bad (source code here).
vector<bool> may have a template specialization and may be implemented using bit array to save space. Extracting and saving a bit and converting it from / to bool may cause the performance drop you are observing. If you use std::vector::push_back, you are resizing the vector which will cause even worse performance. Next performance killer may be assign (Worst complexity: Linear of first argument), instead use operator [] (Complexity: constant).
On the other hand, bool [] is guaranteed to be array of bool.
And you should resize to n instead of n-1 to avoid undefined behaviour.
vector<bool> can be high performance, but isn't required to be. For vector<bool> to be efficient, it needs to operate on many bools at a time (e.g. isPrime.assign(n, true)), and the implementor has had to put loving care into it. Indexing individual bools in a vector<bool> is slow.
Here is a prime finder that I wrote a while back using vector<bool> and clang + libc++ (the libc++ part is important):
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
std::vector<bool>
init_primes()
{
std::vector<bool> primes(0x80000000, true);
primes[0] = false;
primes[1] = false;
const auto pb = primes.begin();
const auto pe = primes.end();
const auto sz = primes.size();
size_t i = 2;
while (true)
{
size_t j = i*i;
if (j >= sz)
break;
do
{
primes[j] = false;
j += i;
} while (j < sz);
i = std::find(pb + (i+1), pe, true) - pb;
}
return primes;
}
int
main()
{
using namespace std::chrono;
using dsec = duration<double>;
auto t0 = steady_clock::now();
auto p = init_primes();
auto t1 = steady_clock::now();
std::cout << dsec(t1-t0).count() << "\n";
}
This executes for me in about 28s (-O3). When I change it to return a vector<char> instead, the execution time goes up to about 44s.
If you run this using some other std::lib, you probably won't see this trend. On libc++ algorithms such as std::find have been optimized to search a word of bits at a time, instead of bit at a time.
See http://howardhinnant.github.io/onvectorbool.html for more details on what std algorithms could be optimized by your vendor.
Related
I read this nice experiment comparing, in particular, the performance of calling insert() on both a vector and a deque container. The result from that particular experiment (Experiment 4) was that deque is vastly superior for this operation.
I implemented my own test using a short sorting function I wrote, which I should note uses the [] operator along with other member functions, and found vastly different results. For example, for inserting 100,000 elements, vector took 24.88 seconds, while deque took 374.35 seconds.
How can I explain this? I imagine it has something to do with my sorting function, but would like the details!
I'm using g++ 4.6 with no optimizations.
Here's the program:
#include <iostream>
#include <vector>
#include <deque>
#include <cstdlib>
#include <ctime>
using namespace std;
size_t InsertionIndex(vector<double>& vec, double toInsert) {
for (size_t i = 0; i < vec.size(); ++i)
if (toInsert < vec[i])
return i;
return vec.size(); // return last index+1 if toInsert is largest yet
}
size_t InsertionIndex(deque<double>& deq, double toInsert) {
for (size_t i = 0; i < deq.size(); ++i)
if (toInsert < deq[i])
return i;
return deq.size(); // return last index+1 if toInsert is largest yet
}
int main() {
vector<double> vec;
deque<double> deq;
size_t N = 100000;
clock_t tic = clock();
for(int i = 0; i < N; ++i) {
double val = rand();
vec.insert(vec.begin() + InsertionIndex(vec, val), val);
// deq.insert(deq.begin() + InsertionIndex(deq, val), val);
}
float total = (float)(clock() - tic) / CLOCKS_PER_SEC;
cout << total << endl;
}
The special case where deque can be much faster than vector is when you're inserting at the front of the container. In your case you're inserting at random locations, which will actually give the advantage to vector.
Also unless you're using an optimized build, it's quite possible that there are bounds checks in the library implementation. Those checks can add significantly to the time. To do a proper benchmark comparison you must run with all normal optimizations turned on and debug turned off.
Your code is performing an insertion sort, which is O(n^2). Iterating over a deque is slower than iterating over a vector.
I suspect the reason you are not seeing the same result as the posted link is because the run-time of your program is dominated by the loop in InsertionIndex not the call to deque::insert (or vector::insert.
I have two arrays. One is "x" factor the size of the second one.
I need to copy from the first (bigger) array to the second (smaller) array only its x element.
Meaning 0,x,2x.
Each array sits as a block in the memory.
The array is of simple values.
I am currently doing it using a loop.
Is there any faster smarter way to do this?
Maybe with ostream?
Thanks!
You are doing something like this right?
#include <cstddef>
int main()
{
const std::size_t N = 20;
const std::size_t x = 5;
int input[N*x];
int output[N];
for(std::size_t i = 0; i < N; ++i)
output[i] = input[i*x];
}
well, I don't know any function that can do that, so I would use the for loop. This is fast.
EDIT: even faster solution (to avoid multiplications)(C++03 Version)
int* inputit = input;
int* outputit = output;
int* outputend = output+N;
while(outputit != outputend)
{
*outputit = *inputit;
++outputit;
inputit+=x;
}
if I get you right you want to copy every n-th element. the simplest solution would be
#include <iostream>
int main(int argc, char **argv) {
const int size[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int out[5];
int *pout = out;
for (const int *i = &size[0]; i < &size[10]; i += 3) {
std::cout << *i << ", ";
*pout++ = *i;
if (pout > &out[4]) {
break;
}
}
std::cout << "\n";
for (const int *i = out; i < pout; i++) {
std::cout << *i << ", ";
}
std::cout << std::endl;
}
You can use copy_if and lambda in C++11:
copy_if(a.begin(), a.end(), b.end(), [&] (const int& i) -> bool
{ size_t index = &i - &a[0]; return index % x == 0; });
A test case would be:
#include <iostream>
#include <vector>
#include <algorithm> // std::copy_if
using namespace std;
int main()
{
std::vector<int> a;
a.push_back(0);
a.push_back(1);
a.push_back(2);
a.push_back(3);
a.push_back(4);
std::vector<int> b(3);
int x = 2;
std::copy_if(a.begin(), a.end(), b.begin(), [&] (const int& i) -> bool
{ size_t index = &i - &a[0]; return index % x == 0; });
for(int i=0; i<b.size(); i++)
{
std::cout<<" "<<b[i];
}
return 0;
}
Note that you need to use a C++11 compatible compiler (if gcc, with -std=c++11 option).
template<typename InIt, typename OutIt>
void copy_step_x(InIt first, InIt last, OutIt result, int x)
{
for(auto it = first; it != last; std::advance(it, x))
*result++ = *it;
}
int main()
{
std::array<int, 64> ar0;
std::array<int, 32> ar1;
copy_step_x(std::begin(ar0), std::end(ar0), std::begin(ar1), ar0.size() / ar1.size());
}
The proper and clean way of doing this is a loop like has been said before. A number of good answers here show you how to do that.
I do NOT recommend doing it in the following fashion, it depends on a lot of specific things, value range of X, size and value range of the variables and so on but for some you could do it like this:
for every 4 bytes:
tmp = copy a 32 bit variable from the array, this now contains the 4 new values
real_tmp = bitmask tmp to get the right variable of those 4
add it to the list
This only works if you want values <= 255 and X==4, but if you want something faster than a loop this is one way of doing it. This could be modified for 16bit, 32bit or 64bit values and every 2,3,4,5,6,7,8(64 bit) values but for X>8 this method will not work, or for values that are not allocated in a linear fashion. It won't work for classes either.
For this kind of optimization to be worth the hassle the code need to run often, I assume you've run a profiler to confirm that the old copy is a bottleneck before starting implementing something like this.
The following is an observation on how most CPU designs are unimaginative when it comes to this sort of thing.
On some OpenVPX you have the ability to DMA data from one processor to another. The one that I use has a pretty advanced DMA controller, and it can do this sort of thing for you.
For example, I could ask it to copy your big array to another CPU, but skipping over N elements of the array, just like you're trying to do. As if by magic the destination CPU would have the smaller array in its memory. I could also if I wanted perform matrix transformations, etc.
The nice thing is that it takes no CPU time at all to do this; it's all done by the DMA engine. My CPUs can then concentrate on harder sums instead of being tied down shuffling data around.
I think the Cell processor in the PS3 can do this sort of thing internally (I know it can DMA data around, I don't know if it will do the strip mining at the same time). Some DSP chips can do it too. But x86 doesn't do it, meaning us software programmers have to write ridiculous loops just moving data in simple patterns. Yawn.
I have written a multithreaded memcpy() in the past to do this sort of thing. The only way you're going to beat a for loop is to have several threads doing your for loop in several parallel chunks.
If you pick the right compiler (eg Intel's ICC or Sun/Oracles Sun Studio) they can be made to automatically parallelise your for loops on your behalf (so your source code doesn't change). That's probably the simplest way to beat your original for loop.
I have a nested for-loop structure and right now I am re-declaring the vector at the start of each iteration:
void function (n1,n2,bound,etc){
for (int i=0; i<bound; i++){
vector< vector<long long> > vec(n1, vector<long long>(n2));
//about three more for-loops here
}
}
This allows me to "start fresh" each iteration, which works great because my internal operations are largely in the form of vec[a][b] += some value. But I worry that it's slow for large n1 or large n2. I don't know the underlying architecture of vectors/arrays/etc so I am not sure what the fastest way is to handle this situation. Should I use an array instead? Should I clear it differently? Should I handle the logic differently altogether?
EDIT: The vector's size technically does not change each iteration (but it may change based on function parameters). I'm simply trying to clear it/etc so the program is as fast as humanly possible given all other circumstances.
EDIT:
My results of different methods:
Timings (for a sample set of data):
reclaring vector method: 111623 ms
clearing/resizing method: 126451 ms
looping/setting to 0 method: 88686 ms
I have a clear preference for small scopes (i.e. declaring the variable in the innermost loop if it’s only used there) but for large sizes this could cause a lot of allocations.
So if this loop is a performance problem, try declaring the variable outside the loop and merely clearing it inside the loop – however, this is only advantageous if the (reserved) size of the vector stays identical. If you are resizing the vector, then you get reallocations anyway.
Don’t use a raw array – it doesn’t give you any advantage, and only trouble.
Here is some code that tests a few different methods.
#include <chrono>
#include <iostream>
#include <vector>
int main()
{
typedef std::chrono::high_resolution_clock clock;
unsigned n1 = 1000;
unsigned n2 = 1000;
// Original method
{
auto start = clock::now();
for (unsigned i = 0; i < 10000; ++i)
{
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
// vec is initialized to zero already
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// reinitialize values to zero at every pass in the loop
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
// initialize vec to zero at the start of every loop
for (unsigned j = 0; j < n1; ++j)
for (unsigned k = 0; k < n2; ++k)
vec[j][k] = 0;
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// clearing the vector this way is not optimal since it will destruct the
// inner vectors
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
vec.clear();
vec.resize(n1, std::vector<long long>(n2));
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
// equivalent to the second method from above
// no performace penalty
{
auto start = clock::now();
std::vector<std::vector<long long>> vec(n1, std::vector<long long>(n2));
for (unsigned i = 0; i < 10000; ++i)
{
for (unsigned j = 0; j < n1; ++j)
{
vec[j].clear();
vec[j].resize(n2);
}
// do stuff
}
auto elapsed_time = clock::now() - start;
std::cout << elapsed_time.count() << std::endl;
}
}
Edit: I've updated the code to make a fairer comparison between the methods.
Edit 2: Cleaned up the code a bit, methods 2 or 4 are the way to go.
Here are the timings of the above four methods on my computer:
16327389
15216024
16371469
15279471
The point is that you should try out different methods and profile your code.
When choosing a container i usually use this diagram to help me:
source
Other than that,
Like previously posted if this is causing performance problems declare the container outside of the for loop and just clear it at the start of each iteration
In addition to the previous comments :
if you use Robinson's swap method, you could go ever faster by handling that swap asynchronously.
Why not something like that :
{
vector< vector<long long> > vec(n1, vector<long long>(n2));
for (int i=0; i<bound; i++){
//about three more for-loops here
vec.clear();
}
}
Edit: added scope braces ;-)
Well if you are really concerned about performance (and you know the size of n1 and n2 beforehand) but don't want to use a C-style array, std::array may be your friend.
EDIT: Given your edit, it seems an std::array isn't an appropriate substitute since while the vector size does not change each iteration, it still isn't known before compilation.
Since you have to reset the vector values to 0 each iteration, in practical terms, this question boils down to "is the cost of allocating and deallocating the memory for the vector cheap or expensive compared to the computations inside the loops".
Assuming the computations are the expensive part of the algorithm, the way you've coded it is both clear, concise, shows the intended scope, and is probably just as fast as alternate approaches.
If however your computations and updates are extremely fast and the allocation/deallocation of the vector is relatively expensive, you could use std::fill to fill zeroes back into the array at the end/beginning of each iteration through the loop.
Of course the only way to know for sure is to measure with a profiler. I suspect you'll find that the approach you took won't show up as a hotspot of any sort and you should leave the obvious code in place.
The overhead of using a vector vs an array is minor, especially when you are getting a lot of useful functionality from the vector. Internally a vector allocates an array. So vector is the way to go.
In the following example a std::map structure is filled with 26 values from A - Z (for key) and 0 - 26 for value. The time taken (on my system) to lookup the last entry (10000000 times) is roughly 250 ms for the vector, and 125 ms for the map. (I compiled using release mode, with O3 option turned on for g++ 4.4)
But if for some odd reason I wanted better performance than the std::map, what data structures and functions would I need to consider using?
I apologize if the answer seems obvious to you, but I haven't had much experience in the performance critical aspects of C++ programming.
#include <ctime>
#include <map>
#include <vector>
#include <iostream>
struct mystruct
{
char key;
int value;
mystruct(char k = 0, int v = 0) : key(k), value(v) { }
};
int find(const std::vector<mystruct>& ref, char key)
{
for (std::vector<mystruct>::const_iterator i = ref.begin(); i != ref.end(); ++i)
if (i->key == key) return i->value;
return -1;
}
int main()
{
std::map<char, int> mymap;
std::vector<mystruct> myvec;
for (int i = 'a'; i < 'a' + 26; ++i)
{
mymap[i] = i - 'a';
myvec.push_back(mystruct(i, i - 'a'));
}
int pre = clock();
for (int i = 0; i < 10000000; ++i)
{
find(myvec, 'z');
}
std::cout << "linear scan: milli " << clock() - pre << "\n";
pre = clock();
for (int i = 0; i < 10000000; ++i)
{
mymap['z'];
}
std::cout << "map scan: milli " << clock() - pre << "\n";
return 0;
}
For your example, use int value(char x) { return x - 'a'; }
More generalized, since the "keys" are continuous and dense, use an array (or vector) to guarantee Θ(1) access time.
If you don't need the keys to be sorted, use unordered_map, which should provide amortized logarithmic improvement (i.e. O(log n) -> O(1)) to most operations.
(Sometimes, esp. for small data sets, linear search is faster than hash table (unordered_map) / balanced binary trees (map) because the former has a much simpler algorithm, thus reducing the hidden constant in big-O. Profile, profile, profile.)
For starters, you should probably use std::map::find if you want to compare the search times; operator[] has additional functionality over and above the regular find.
Also, your data set is pretty small, which means that the whole vector will easily fit into the processor cache; a lot of modern processors are optimised for this sort of brute-force search so you'd end up getting fairly good performance. The map, while theoretically having better performance (O(log n) rather than O(n)) can't really exploit its advantage of the smaller number of comparisons because there aren't that many keys to compare against and the overhead of its data layout works against it.
TBH for data structures this small, the additional performance gain from not using a vector is often negligible. The "smarter" data structures like std::map come into play when you're dealing with larger amounts of data and a well distributed set of data that you are searching for.
If you really just have values for all entries from A to Z, why don't you use letter (properly adjusted) as the index into a vector?:
std::vector<int> direct_map;
direct_map.resize(26);
for (int i = 'a'; i < 'a' + 26; ++i)
{
direct_map[i - 'a']= i - 'a';
}
// ...
int find(const std::vector<int> &direct_map, char key)
{
int index= key - 'a';
if (index>=0 && index<direct_map.size())
return direct_map[index];
return -1;
}
I'm trying to optimize my C++ code. I've searched the internet on using dynamically allocated C++ arrays vs using std::vector and have generally seen a recommendation in favor of std::vector and that the difference in performance between the two is negligible. For instance here - Using arrays or std::vectors in C++, what's the performance gap?.
However, I wrote some code to test the performance of iterating through an array/vector and assigning values to the elements and I generally found that using dynamically allocated arrays was nearly 3 times faster than using vectors (I did specify a size for the vectors beforehand). I used g++-4.3.2.
However I feel that my test may have ignored issues I don't know about so I would appreciate any advice on this issue.
Thanks
Code used -
#include <time.h>
#include <iostream>
#include <vector>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
When benchmarking C++ comtainers, it's important to enable most compiler optimisations. Several of my own answers on SO have fallen foul of this - for example, the function call overhead when something like operator[] is not inlined can be very significant.
Just for fun, try iterating over the plain array using a pointer instead of an integer index (the code should look just like the vector iteration, since the point of STL iterators is to appear like pointer arithmetic for most operations). I bet the speed will be exactly equal in that case. Which of course means you should pick the vector, since it will save you a world of headaches from managing arrays by hand.
The thing about the standard library classes such as std::vector is that yes, naively, it is a lot more code than a raw array. But all of it can be trivially inlined by the compiler, which means that if optimizations are enabled, it becomes essentially the same code as if you'd used a raw array. The speed difference then is not negligible but non-existent. All the overhead is removed at compile-time.
But that requires compiler optimizations to be enabled.
I imagine the reason why you found iterating and adding to std::vector 3 times slower than a plain array is a combination of the cost of iterating the vector and doing the assigment.
Edit:
That was my initial assumption before the testcase; however running the testcase (compiled with -O3) shows the converse - std::vector is actually 3 times faster, which surprised me.
I can't see how std::vector could be faster (certainly not 3 times faster) than a vanilla array copy - I think there's some optimisation being applied to the std::vector compiled code which isn't happening for the array version.
Original benchmark results:
$ ./array
array: 0.059375
vector: 0.021209
std::vector is 3x faster. Same benchmark again, except add an additional outer loop to run the test iterater loop 1000 times:
$ ./array
array: 21.7129
vector: 21.6413
std::vector is now ~ the same speed as array.
Edit 2
Found it! So the problem with your test case is that in the vector case the memory holding the data appears to be already in the CPU cache - either by the way it is initialised, or due to the call to vec.end(). If I 'warm' up the CPU cache before each timing test, I get the same numbers for array and vector:
#include <time.h>
#include <iostream>
#include <vector>
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
// get vec into CPU cache.
for (int i = 0; vecIt != vecEnd; i++) { *(vecIt++) = i; }
vecIt = vec.begin();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i;
}
end = clock();
std::cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
int* arr = new int[9999999];
// get arr into CPU cache.
for (int i = 0; i < 9999999; i++) { arr[i] = i; }
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
std::cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
}
This gives me the following result:
$ ./array
vector: 0.020875
array: 0.020695
I agree with rmeador,
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = i; // <-- quick offset calculation
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i; // <-- not fair play :) - offset = arr + i*size(int)
}
I think the answer here is obvious: it doesn't matter. Like jalf said the code will end up being about the same, but even if it wasn't, look at the numbers. The code you posted creates a huge array of 10 MILLION items, yet iterating over the entire array takes only a few hundredths of a second.
Even if your application really is working with that much data, whatever it is you're actually doing with that data is likely to take much more time than iterating over your array. Just use whichever data structure you prefer, and focus your time on the rest of your code.
To prove my point, here's the code with one change: the assignment of i to the array item is replaced with an assignment of sqrt(i). On my machine using -O2, the execution time triples from .02 to .06 seconds.
#include <time.h>
#include <iostream>
#include <vector>
#include <math.h>
using namespace std;
int main() {
clock_t start,end;
std::vector<int> vec(9999999);
std::vector<int>::iterator vecIt = vec.begin();
std::vector<int>::iterator vecEnd = vec.end();
start = clock();
for (int i = 0; vecIt != vecEnd; i++) {
*(vecIt++) = sqrt(i);
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
int* arr = new int[9999999];
start = clock();
for (int i = 0; i < 9999999; i++) {
arr[i] = i;
}
end = clock();
cout<<"array: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
}
The issue seems to be that you compiled your code with optimizations turned off. On my machine, OS X 10.5.7 with g++ 4.0.1 I actually see that the vector is faster than primitive arrays by a factor of 2.5.
With gcc try to pass -O2 to the compiler and see if there's any improvement.
The reason that your array iterating is faster is that the the number of iteration is constant, and compiler is able to unroll the loop. Try to use rand to generate a number, and multiple it to be a big number you wanted so that compiler wont be able to figure it out at compile time. Then try it again, you will see similar runtime results.
One reason you're code might not be performing quite the same is because on your std::vector version, you are incrimenting two values, the integer i and the std::vector::iterator vecIt. To really be equivalent, you could refactor to
start = clock();
for (int i = 0; i < vec.size(); i++) {
vec[i] = i;
}
end = clock();
cout<<"vector: "<<(double)(end-start)/CLOCKS_PER_SEC<<endl;
Your code provides an unfair comparison between the two cases since you're doing far more work in the vector test than the array test.
With the vector, you're incrementing both the iterator (vecIT) and a separate variable (i) for generating the assignment values.
With the array, you're only incrementing the variable i and using it for dual purpose.