QVector<int> writing/resize performance - c++

I was trying to refactor some code that use C-style vector of integers to QVector (as the rest of code use Qt).
Before I do that, I made performance tests to check how bad this change would be.
I uses this code:
#include <QVector>
#include <vector>
#include <cstdio>
#include <ctime>
void test1(int MAX_ELEMENTS, int TIMES) {
int vec[MAX_ELEMENTS];
int nelems = 0;
for (int j=0; j<TIMES; j++) {
nelems = MAX_ELEMENTS;
for (int i=0; i<MAX_ELEMENTS; i++)
vec[i] = 2;
}
printf("Vec[0] = %d\n", vec[0]);
}
void test2(int MAX_ELEMENTS, int TIMES) {
std::vector<int> vec;
vec.reserve(MAX_ELEMENTS);
for (int j=0; j<TIMES; j++) {
vec.clear();
for (int i=0; i<MAX_ELEMENTS; i++)
vec.push_back(2);
}
printf("Vec[0] = %d\n", vec[0]);
}
void test3(int MAX_ELEMENTS, int TIMES) {
QVector<int> vec;
vec.reserve(MAX_ELEMENTS);
for (int j=0; j<TIMES; j++) {
vec.clear();
for (int i=0; i<MAX_ELEMENTS; i++)
vec.push_back(2);
}
printf("Vec[0] = %d\n", vec[0]);
}
void test4(int MAX_ELEMENTS, int TIMES) {
QVector<int> vec;
vec.reserve(MAX_ELEMENTS);
for (int j=0; j<TIMES; j++) {
vec.resize(MAX_ELEMENTS);
for (int i=0; i<MAX_ELEMENTS; i++)
vec[i] = 2;
}
printf("Vec[0] = %d\n", vec[0]);
}
double measureExecutionTime(void (*func)(int, int)) {
const int MAX_ELEMENTS=30000;
const int TIMES=2000000;
clock_t begin, end;
begin = clock();
(*func)(MAX_ELEMENTS, TIMES);
end = clock();
return (double)(end - begin) / CLOCKS_PER_SEC;
}
int main() {
double time_spent;
time_spent = measureExecutionTime(test1);
printf("Test 1 (plain c): %lf\n", time_spent);
time_spent = measureExecutionTime(test2);
printf("Test 2 (std::vector): %lf\n", time_spent);
time_spent = measureExecutionTime(test3);
printf("Test 3 (QVector clear): %lf\n", time_spent);
time_spent = measureExecutionTime(test4);
printf("Test 4 (QVector resize): %lf\n", time_spent);
return 0;
}
And the results were:
Vec[0] = 2
Test 1 (plain c): 16.130129
Vec[0] = 2
Test 2 (std::vector): 92.719583
Vec[0] = 2
Test 3 (QVector clear): 109.882463
Vec[0] = 2
Test 4 (QVector resize): 46.261172
Any ideas on a a different way in order to increase the QVector performance?
This vector is filled from 0 to its new size few times per second (it is used in a timetable scheduling software).
Qt version: 5.7.1(+dsfg1, from Debian testing).
The command line I used to compile from a Linux shell:
g++ -c -m64 -pipe -O2 -Wall -W -D_REENTRANT -fPIC -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_CORE_LIB -I. -I. -isystem /usr/include/x86_64-linux-gnu/qt5 -isystem /usr/include/x86_64-linux-gnu/qt5/QtGui -isystem /usr/include/x86_64-linux-gnu/qt5/QtCore -I. -I/usr/lib/x86_64-linux-gnu/qt5/mkspecs/linux-g++-64 -o teste.o teste.cpp
And to be clear: the vector elements are not equal, I just put them equals 2. The number of valid elements in vector keeps changing as the number of activities of timetable are successfully scheduled - When a given activity cannot be placed in the remaining slots, the algorithm starts rolling back, removing some of the last placed activities in order to start another try of scheduling.

Actually, as MrEricSir pointed out in the comments section, the clear() operation is the real culprit in test4.
If you check the Qt documentation you find the following:
void QVector::clear()
Removes all the elements from the vector.
Note: Until Qt 5.6, this also released the memory used by the vector.
From Qt 5.7, the capacity is preserved.
You may be using Qt version < 5.7 which forces the release of memory within the loop.

Your test is not comparing apples to apples.
int vec[MAX_ELEMENTS] will be allocated on the stack, while std::vector and QVector will use the heap. Stack-based allocation is much faster. If you can get away with stack-based allocation consider std::array.
if you just interested in filling the vector with values, try to benchmark only this part.
In other words, try to benchmark:
for(int i = 0; i < max_size; ++i)
c_array[i] = i;
vs
std_vector.resize(max_size); // <- initialization, not benchmarked
for(int i = 0; i < max_size; ++i)
std_vector[i] = i;
vs
qvector.resize(max_size); // <- initialization, not benchmarked
for(int i = 0; i < max_size; ++i)
qvector[i] = i;
The performance should be fairly similar.

The test is pretty much nonsense to begin with, and not just nonsense, but poorly implemented at that. Especially in terms of testing resizing, with a static size raw C array. How does that test resizing without any resizing taking place? And even for the containers that use any resizing, you are resizing to the already reserved value, which shouldn't have any effect, since your MAX_ELEMENTS never varies throughout the test.
That being said, for such trivial operations QVector is more than likely to suffer from being a COW. That is Qt's implicit sharing for containers, with Copy-On-Write, meaning that each non-const method executed will do a check if the container data is being shared in order to detach it if that's the case. That check involves atomics, which involve synchronization, which is expensive.
If you were to execute more adequate test for write access, such as this:
int elements = 30000, times = 200000;
QElapsedTimer t;
int * ia = new int[elements];
t.start();
for (int it = 0; it < times; ++ it) {
for (int ic = 0; ic < elements; ++ic) {
ia[ic] = 2;
}
}
qDebug() << t.elapsed() << " msec for raw array";
QVector<int> iqv(elements);
t.restart();
for (int it = 0; it < times; ++ it) {
for (int ic = 0; ic < elements; ++ic) {
iqv[ic] = 2;
}
}
qDebug() << t.elapsed() << " msec for qvector";
std::vector<int> isv;
isv.reserve(elements);
t.restart();
for (int it = 0; it < times; ++ it) {
for (int ic = 0; ic < elements; ++ic) {
isv[ic] = 2;
}
}
qDebug() << t.elapsed() << " msec for std::vector";
you'd see results that very much resonate with what's been said above, on my system the results are:
1491 msec for raw array
4238 msec for qvector
1491 msec for std::vector
The times for a raw / C array and std::vector are practically identical, while QVector languishes.
If we were to test reading by accumulating the container values the situation is reversed, note that in this case we are using at(index) const for QVector:
2169 msec for raw array
2170 msec for qvector
2801 msec for std::vector
Without the penalty of COW QVector has performance identical to a raw C array. It is std::vector that loses here, although I am not sure exactly why, maybe someone else can elaborate.

Related

Accelerating a nested loop in C++

I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.
With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}
Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.
On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.

Copy local array is faster than array from arguments in c++?

While optimizing some code I discovered some things that I didn't expected.
I wrote a simple code to illustrate what I found below:
#include <string.h>
#include <chrono>
#include <iostream>
using namespace std;
int globalArr[1024][1024];
void initArr(int arr[1024][1024])
{
memset(arr, 0, 1024 * 1024 * sizeof(int));
}
void run()
{
int arr[1024][1024];
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
void run2(int arr[1024][1024])
{
initArr(arr);
for(int i = 0; i < 1024; ++i)
{
for(int j = 0; j < 1024; ++j)
{
globalArr[i][j] = arr[i][j];
}
}
}
int main()
{
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
run();
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
{
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < 256; ++i)
{
int arr[1024][1024];
run2(arr);
}
auto duration = chrono::high_resolution_clock::now() - start;
cout << "(run2) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
}
return 0;
}
I build the code with g++ version 6.4.0 20180424 with -O3 flag.
Below is the result running on ryzen 1700.
(run) Total time: 43493 microseconds
(run2) Total time: 134740 microseconds
I tried to see the assembly with godbolt.org (Code separated in 2 urls)
https://godbolt.org/g/aKSHH6
https://godbolt.org/g/zfK14x
But I still don't understand what actually made the difference.
So my questions are:
1. What's causing the performance difference?
2. Is it possible passing array in argument with the same performance as local array?
Edit:
Just some extra info, below is the result build using O2
(run) Total time: 94461 microseconds
(run2) Total time: 172352 microseconds
Edit again:
From xaxxon's comment, I try remove the initArr call in both functions. And the result actually run2 is better than run
(run) Total time: 45151 microseconds
(run2) Total time: 35845 microseconds
But I still don't understand the reason.
What's causing the performance difference?
The compiler has to generate code for run2 that will continue to work correctly if you call
run2(globalArr);
or (worse), pass in some overlapping but non-identical address.
If you allow your C++ compiler to inline the call, and it chooses to do so, it'll be able to generate inlined code that knows whether the parameter really aliases your global. The out-of-line codegen still has to be conservative though.
Is it possible passing array in argument with the same performance as local array?
You can certainly fix the aliasing problem in C, using the restrict keyword, like
void run2(int (* restrict globalArr2)[256])
{
int (* restrict g)[256] = globalArr1;
for(int i = 0; i < 32; ++i)
{
for(int j = 0; j < 256; ++j)
{
g[i][j] = globalArr2[i][j];
}
}
}
(or probably in C++ using the non-standard extension __restrict).
This should allow the optimizer as much freedom as it had in your original run - unless it's smart enough to elide the local entirely and simply set the global to zero.

Fastest erase element from vector or better use of memory (sorting radix)

I have a trouble, I create a implementation from radix sort algorithm, but I think that I can use less memory, And I can! but... I do this erasing the element of the vector after used. The problem: 3 minutes of execution vs 17 secs. How erase faster the element? or... How use the memory better.
sort.hpp
#include <iostream>
#include <vector>
#include <algorithm>
unsigned digits_counter(long long unsigned x);
void radix( std::vector<unsigned> &vec )
{
unsigned size = vec.size();
if(size == 0);
else {
unsigned int max = *max_element(vec.begin(), vec.end());
unsigned int digits = digits_counter( max );
// FOR EVERY 10 POWER...
for (unsigned i = 0; i < digits; i++) {
std::vector < std::vector <unsigned> > base(10, std::vector <unsigned> ());
#ifdef ERASE
// GET EVERY NUMBER IN THE VECTOR AND
for (unsigned j = 0; j < size; j++) {
unsigned int digit = vec[0];
// GET THE DIGIT FROM POSITION "i" OF THE NUMBER vec[j]
for (unsigned k = 0; k < i; k++)
digit /= 10;
digit %= 10;
// AND PUSH NUMBER IN HIS BASE BUCKET
base[ digit ].push_back( vec[0] );
vec.erase(vec.begin());
}
#else
// GET EVERY NUMBER IN THE VECTOR AND
for (unsigned j = 0; j < size; j++) {
unsigned int digit = vec[j];
// GET THE DIGIT FROM POSITION "i" OF THE NUMBER vec[j]
for (unsigned k = 0; k < i; k++)
digit /= 10;
digit %= 10;
// AND PUSH NUMBER IN HIS BASE BUCKET
base[ digit ].push_back( vec[j] );
}
vec.erase(vec.begin(), vec.end());
#endif
for (unsigned j = 0; j < 10; j++)
for (unsigned k = 0; k < base[j].size(); k++)
vec.push_back( base[j][k] );
}
}
}
void fancy_sort( std::vector <unsigned> &v ) {
if( v.size() <= 1 )
return;
if( v.size() == 2 ) {
if (v.front() >= v.back())
std::swap(v.front(), v.back());
return;
}
radix(v);
}
sort.cpp
#include <vector>
#include "sort.hpp"
using namespace std;
int main(void)
{
vector <unsigned> vec;
vec.resize(rand()%10000);
for (unsigned j = 0; j < 10000; j++) {
for (unsigned int i = 0; i < vec.size(); i++)
vec[i] = rand() % 100;
fancy_sort(vec);
}
return 0;
}
I'm just learning... it's my 2nd chapter of Deitel C++. So... If someone have a more complex solution... I can learn how use it, the difficulty doesn't matter.
Results
Without erase:
g++ -O3 -Wall sort.cpp && time ./a.out
./a.out 2.93s user 0.00s system 98% cpu 2.964 total
With erase:
g++ -D ERASE -O3 -Wall sort.cpp && time ./a.out
./a.out 134.64s user 0.06s system 99% cpu 2:15.20 total
std::vector is not made to have parts removed from it. That's a fact you have to live with. Your trade between speed and memory usage is a classical problem. With vectors, any removal (except from its end) is expensive and is a waste. This is because every time you remove elements, the program either has to reallocate the array internally, or has to shift all elements to fill the voids. That's an ultimate limitation that you can never overcome if you keep using vectors.
vectors for your problem: fast but lots of memory usage.
The other (probably) optimal extreme (memory-wise) is std::list, which has absolutely no problem with removing anything anywhere. On the other hand, accessing elements is only possible by iterating over the whole list to the element, because it's basically a doubly-linked list, and you can't access an element by its number.
lists for your problem: best memory usage, but slow because accessing list elements are slow.
Finally, the middle grounds is std::deque. They offer a middle grounds between vectors and lists. They're not contiguous in memory, but items can be found by their numbers. Removing elements from the middle doesn't necessarily cause the same destruction in vectors. Read this to learn more about them.
deques for your problem: middle grounds, depending on the problem, probably fast for both memory and access.
If memory is your most important concern, then definitely go with lists. That's the fastest for that. If you want the most general solution, go with deque. Also don't neglect the possibility of copying the whole array to another container, sort it, and then copy it back. Depending on the size of the array, that might be helpful.

Intrinsic array access is much faster than std::vector access -- Black Magic?

I have set up a test program to compare array access performance to that of std::vector. I have found several similar questions but none seem to address my specific concern. I was scratching my head for some time over why array access seemed to be 6 times faster than vector access, when I have read in the past that they should be equivalent. As it turns out, this seems to be a function of the Intel compiler (v12) and optimization (occurs with anything above -O1), since I see better performance with std::vector when using gcc v4.1.2, and array has only a 2x advantage with gcc v4.4.4. I am running the tests on a RHEL 5.8 machine with Xeon X5355 cores. As an aside, I have found iterators to be faster than element access.
I am compiling with the following commands:
icpc -fast test.cc
g++44 -O3 test.cc
Can anyone explain the dramatic improvement in speed?
#include <vector>
#include <iostream>
using namespace std;
int main() {
int sz = 100;
clock_t start,stop;
int ncycle=1000;
float temp = 1.1;
// Set up and initialize vector
vector< vector< vector<float> > > A(sz, vector< vector<float> >(sz, vector<float>(sz, 1.0)));
// Set up and initialize array
float*** a = new float**[sz];
for( int i=0; i<sz; ++i) {
a[i] = new float*[sz];
for( int j=0; j<sz; ++j) {
a[i][j] = new float[sz]();
for( int k=0; k<sz; ++k)
a[i][j][k] = 1.0;
}
}
// Time the array
start = clock();
for( int n=0; n<ncycle; ++n )
for( int i=0; i<sz; ++i )
for( int j=0; j<sz; ++j )
for( int k=0; k<sz; ++k )
a[i][j][k] *= temp;
stop = clock();
std::cout << "STD ARRAY: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" << std::endl;
// Time the vector
start = clock();
/*
*/
for( int n=0; n < ncycle; ++n )
for (vector<vector<vector<float> > >::iterator it1 = A.begin(); it1 != A.end(); ++it1)
for (vector<vector<float> >::iterator it2 = it1->begin(); it2 != it1->end(); ++it2)
for (vector<float>::iterator it3 =it2->begin(); it3 != it2->end(); ++it3)
*it3 *= temp;
/*
for( int n=0; n < ncycle; ++n )
for( int i=0; i < sz; ++i )
for( int j=0; j < sz; ++j )
for( int k=0; k < sz; ++k )
A[i][j][k] *= temp;
*/
stop = clock();
std::cout << "VECTOR: " << double((stop - start)) / CLOCKS_PER_SEC << " seconds" << std::endl;
for( int i=0; i<100; ++i) {
for( int j=0; j<100; ++j)
delete[] a[i][j];
}
for( int i=0; i<100; ++i) {
delete[] a[i];
}
delete[] a;
return 0;
}
SOLVED
After noting Bo's indication that the compiler "knows everything" about the loop and can therefore optimize it more than the vector case, I replaced the multiplications by "temp" with multiplications by a call to "rand()". This leveled the playing field and in fact seems to give std::vector a slight lead. Timing of various scenarios are as follows:
ARRAY (flat): 111.15 seconds
ARRAY (flat): 0.011115 seconds per cycle
ARRAY (3d): 111.73 seconds
ARRAY (3d): 0.011173 seconds per cycle
VECTOR (flat): 110.51 seconds
VECTOR (flat): 0.011051 seconds per cycle
VECTOR (3d): 118.05 seconds
VECTOR (3d): 0.011805 seconds per cycle
VECTOR (flat iterator): 108.55 seconds
VECTOR (flat iterator): 0.010855 seconds per cycle
VECTOR (3d iterator): 111.93 seconds
VECTOR (3d iterator): 0.011193 seconds per cycle
The takeaway seems to be that vectors are just as fast as arrays, and slightly faster when flattened (contiguous memory) and used with iterators. My experiment only averaged over 10,000 iterations, so it could be argued that these are all roughly equivalent and the choice of which to use should be determined by whichever is easiest to use; in my case, that would be the "3d iterator" case.
There is no black magic here, it is just too easy for the compiler to see that here
for( int n=0; n<ncycle; ++n )
for( int i=0; i<sz; ++i )
for( int j=0; j<sz; ++j )
for( int k=0; k<sz; ++k )
a[i][j][k] *= temp;
everything is known at compile time. It can easily unroll the loop to speed it up.
The array is just a pointer to an area of memory. You cannot get more direct than that.
A vector needs to call inline functions to do the same thing. Which may or not be really inlined. as inline and __forceinline are just guidances for the compiler to inline your code. The compiler in turn is free to do whatever it wants.
Also, iterator is not necessarily a pointer. The compiler, is calling an inline function which may or not be inlined. (again inline is a guidance to the compiler, not a rule).
When in doubt. Compile to assembly and see the resulting code. If the compiler is doing its job, there should be no noticeable difference. If however, if the compiler is not inlining what is supposed to inline, you will get a huge performance penalty compiler to using vector instead of an array.
All elements in an nested array are in contiguous memory locations, so when the compiler encounters an expression of the form a[x][y][z] it generates code which calculates the real index, which involves just integer multiplications and additions.
An nested std::vector, on the other hand, is really nested. To expression v[x][y][z], involves two more levels of indirection, because at v[x] there is really a std::vector object which contains a pointer to an array of vectors which in turn contain the real data.

Is std::vector so much slower than plain arrays?

I've always thought it's the general wisdom that std::vector is "implemented as an array," blah blah blah. Today I went down and tested it, and it seems to be not so:
Here's some test results:
UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds
That's about 3 - 4 times slower! Doesn't really justify for the "vector may be slower for a few nanosecs" comments.
And the code I used:
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorPushBack();
return 0;
}
Am I doing it wrong or something? Or have I just busted this performance myth?
I'm using Release mode in Visual Studio 2005.
In Visual C++, #define _SECURE_SCL 0 reduces UseVector by half (bringing it down to 4 seconds). This is really huge, IMO.
Using the following:
g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.196 seconds
UseVector completed in 4.412 seconds
UseVectorPushBack completed in 8.017 seconds
The whole thing completed in 14.626 seconds
So array is twice as quick as vector.
But after looking at the code in more detail this is expected; as you run across the vector twice and the array only once. Note: when you resize() the vector you are not only allocating the memory but also running through the vector and calling the constructor on each member.
Re-Arranging the code slightly so that the vector only initializes each object once:
std::vector<Pixel> pixels(dimensions * dimensions, Pixel(255,0,0));
Now doing the same timing again:
g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseVector completed in 2.216 seconds
The vector now performance only slightly worse than the array. IMO this difference is insignificant and could be caused by a whole bunch of things not associated with the test.
I would also take into account that you are not correctly initializing/Destroying the Pixel object in the UseArrray() method as neither constructor/destructor is not called (this may not be an issue for this simple class but anything slightly more complex (ie with pointers or members with pointers) will cause problems.
Great question. I came in here expecting to find some simple fix that would speed the vector tests right up. That didn't work out quite like I expected!
Optimization helps, but it's not enough. With optimization on I'm still seeing a 2X performance difference between UseArray and UseVector. Interestingly, UseVector was significantly slower than UseVectorPushBack without optimization.
# g++ -Wall -Wextra -pedantic -o vector vector.cpp
# ./vector
UseArray completed in 20.68 seconds
UseVector completed in 120.509 seconds
UseVectorPushBack completed in 37.654 seconds
The whole thing completed in 178.845 seconds
# g++ -Wall -Wextra -pedantic -O3 -o vector vector.cpp
# ./vector
UseArray completed in 3.09 seconds
UseVector completed in 6.09 seconds
UseVectorPushBack completed in 9.847 seconds
The whole thing completed in 19.028 seconds
Idea #1 - Use new[] instead of malloc
I tried changing malloc() to new[] in UseArray so the objects would get constructed. And changing from individual field assignment to assigning a Pixel instance. Oh, and renaming the inner loop variable to j.
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
// Same speed as malloc().
Pixel * pixels = new Pixel[dimension * dimension];
for(int j = 0 ; j < dimension * dimension; ++j)
pixels[j] = Pixel(255, 0, 0);
delete[] pixels;
}
}
Surprisingly (to me), none of those changes made any difference whatsoever. Not even the change to new[] which will default construct all of the Pixels. It seems that gcc can optimize out the default constructor calls when using new[], but not when using vector.
Idea #2 - Remove repeated operator[] calls
I also attempted to get rid of the triple operator[] lookup and cache the reference to pixels[j]. That actually slowed UseVector down! Oops.
for(int j = 0; j < dimension * dimension; ++j)
{
// Slower than accessing pixels[j] three times.
Pixel &pixel = pixels[j];
pixel.r = 255;
pixel.g = 0;
pixel.b = 0;
}
# ./vector
UseArray completed in 3.226 seconds
UseVector completed in 7.54 seconds
UseVectorPushBack completed in 9.859 seconds
The whole thing completed in 20.626 seconds
Idea #3 - Remove constructors
What about removing the constructors entirely? Then perhaps gcc can optimize out the construction of all of the objects when the vectors are created. What happens if we change Pixel to:
struct Pixel
{
unsigned char r, g, b;
};
Result: about 10% faster. Still slower than an array. Hm.
# ./vector
UseArray completed in 3.239 seconds
UseVector completed in 5.567 seconds
Idea #4 - Use iterator instead of loop index
How about using a vector<Pixel>::iterator instead of a loop index?
for (std::vector<Pixel>::iterator j = pixels.begin(); j != pixels.end(); ++j)
{
j->r = 255;
j->g = 0;
j->b = 0;
}
Result:
# ./vector
UseArray completed in 3.264 seconds
UseVector completed in 5.443 seconds
Nope, no different. At least it's not slower. I thought this would have performance similar to #2 where I used a Pixel& reference.
Conclusion
Even if some smart cookie figures out how to make the vector loop as fast as the array one, this does not speak well of the default behavior of std::vector. So much for the compiler being smart enough to optimize out all the C++ness and make STL containers as fast as raw arrays.
The bottom line is that the compiler is unable to optimize away the no-op default constructor calls when using std::vector. If you use plain new[] it optimizes them away just fine. But not with std::vector. Even if you can rewrite your code to eliminate the constructor calls that flies in face of the mantra around here: "The compiler is smarter than you. The STL is just as fast as plain C. Don't worry about it."
This is an old but popular question.
At this point, many programmers will be working in C++11. And in C++11 the OP's code as written runs equally fast for UseArray or UseVector.
UseVector completed in 3.74482 seconds
UseArray completed in 3.70414 seconds
The fundamental problem was that while your Pixel structure was uninitialized, std::vector<T>::resize( size_t, T const&=T() ) takes a default constructed Pixel and copies it. The compiler did not notice it was being asked to copy uninitialized data, so it actually performed the copy.
In C++11, std::vector<T>::resize has two overloads. The first is std::vector<T>::resize(size_t), the other is std::vector<T>::resize(size_t, T const&). This means when you invoke resize without a second argument, it simply default constructs, and the compiler is smart enough to realize that default construction does nothing, so it skips the pass over the buffer.
(The two overloads where added to handle movable, constructable and non-copyable types -- the performance improvement when working on uninitialized data is a bonus).
The push_back solution also does fencepost checking, which slows it down, so it remains slower than the malloc version.
live example (I also replaced the timer with chrono::high_resolution_clock).
Note that if you have a structure that usually requires initialization, but you want to handle it after growing your buffer, you can do this with a custom std::vector allocator. If you want to then move it into a more normal std::vector, I believe careful use of allocator_traits and overriding of == might pull that off, but am unsure.
To be fair, you cannot compare a C++ implementation to a C implementation, as I would call your malloc version. malloc does not create objects - it only allocates raw memory. That you then treat that memory as objects without calling the constructor is poor C++ (possibly invalid - I'll leave that to the language lawyers).
That said, simply changing the malloc to new Pixel[dimensions*dimensions] and free to delete [] pixels does not make much difference with the simple implementation of Pixel that you have. Here's the results on my box (E6600, 64-bit):
UseArray completed in 0.269 seconds
UseVector completed in 1.665 seconds
UseVectorPushBack completed in 7.309 seconds
The whole thing completed in 9.244 seconds
But with a slight change, the tables turn:
Pixel.h
struct Pixel
{
Pixel();
Pixel(unsigned char r, unsigned char g, unsigned char b);
unsigned char r, g, b;
};
Pixel.cc
#include "Pixel.h"
Pixel::Pixel() {}
Pixel::Pixel(unsigned char r, unsigned char g, unsigned char b)
: r(r), g(g), b(b) {}
main.cc
#include "Pixel.h"
[rest of test harness without class Pixel]
[UseArray now uses new/delete not malloc/free]
Compiled this way:
$ g++ -O3 -c -o Pixel.o Pixel.cc
$ g++ -O3 -c -o main.o main.cc
$ g++ -o main main.o Pixel.o
we get very different results:
UseArray completed in 2.78 seconds
UseVector completed in 1.651 seconds
UseVectorPushBack completed in 7.826 seconds
The whole thing completed in 12.258 seconds
With a non-inlined constructor for Pixel, std::vector now beats a raw array.
It would appear that the complexity of allocation through std::vector and std:allocator is too much to be optimised as effectively as a simple new Pixel[n]. However, we can see that the problem is simply with the allocation not the vector access by tweaking a couple of the test functions to create the vector/array once by moving it outside the loop:
void UseVector()
{
TestTimer t("UseVector");
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < 1000; ++i)
{
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
and
void UseArray()
{
TestTimer t("UseArray");
int dimension = 999;
Pixel * pixels = new Pixel[dimension * dimension];
for(int i = 0; i < 1000; ++i)
{
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
delete [] pixels;
}
We get these results now:
UseArray completed in 0.254 seconds
UseVector completed in 0.249 seconds
UseVectorPushBack completed in 7.298 seconds
The whole thing completed in 7.802 seconds
What we can learn from this is that std::vector is comparable to a raw array for access, but if you need to create and delete the vector/array many times, creating a complex object will be more time consuming that creating a simple array when the element's constructor is not inlined. I don't think that this is very surprising.
It was hardly a fair comparison when I first looked at your code; I definitely thought you weren't comparing apples with apples. So I thought, let's get constructors and destructors being called on all tests; and then compare.
const size_t dimension = 1000;
void UseArray() {
TestTimer t("UseArray");
for(size_t j = 0; j < dimension; ++j) {
Pixel* pixels = new Pixel[dimension * dimension];
for(size_t i = 0 ; i < dimension * dimension; ++i) {
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = (unsigned char) (i % 255);
}
delete[] pixels;
}
}
void UseVector() {
TestTimer t("UseVector");
for(size_t j = 0; j < dimension; ++j) {
std::vector<Pixel> pixels(dimension * dimension);
for(size_t i = 0; i < dimension * dimension; ++i) {
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = (unsigned char) (i % 255);
}
}
}
int main() {
TestTimer t1("The whole thing");
UseArray();
UseVector();
return 0;
}
My thoughts were, that with this setup, they should be exactly the same. It turns out, I was wrong.
UseArray completed in 3.06 seconds
UseVector completed in 4.087 seconds
The whole thing completed in 10.14 seconds
So why did this 30% performance loss even occur? The STL has everything in headers, so it should have been possible for the compiler to understand everything that was required.
My thoughts were that it is in how the loop initialises all values to the default constructor. So I performed a test:
class Tester {
public:
static int count;
static int count2;
Tester() { count++; }
Tester(const Tester&) { count2++; }
};
int Tester::count = 0;
int Tester::count2 = 0;
int main() {
std::vector<Tester> myvec(300);
printf("Default Constructed: %i\nCopy Constructed: %i\n", Tester::count, Tester::count2);
return 0;
}
The results were as I suspected:
Default Constructed: 1
Copy Constructed: 300
This is clearly the source of the slowdown, the fact that the vector uses the copy constructor to initialise the elements from a default constructed object.
This means, that the following pseudo-operation order is happening during construction of the vector:
Pixel pixel;
for (auto i = 0; i < N; ++i) vector[i] = pixel;
Which, due to the implicit copy constructor made by the compiler, is expanded to the following:
Pixel pixel;
for (auto i = 0; i < N; ++i) {
vector[i].r = pixel.r;
vector[i].g = pixel.g;
vector[i].b = pixel.b;
}
So the default Pixel remains un-initialised, while the rest are initialised with the default Pixel's un-initialised values.
Compared to the alternative situation with New[]/Delete[]:
int main() {
Tester* myvec = new Tester[300];
printf("Default Constructed: %i\nCopy Constructed:%i\n", Tester::count, Tester::count2);
delete[] myvec;
return 0;
}
Default Constructed: 300
Copy Constructed: 0
They are all left to their un-initialised values, and without the double iteration over the sequence.
Armed with this information, how can we test it? Let's try over-writing the implicit copy constructor.
Pixel(const Pixel&) {}
And the results?
UseArray completed in 2.617 seconds
UseVector completed in 2.682 seconds
The whole thing completed in 5.301 seconds
So in summary, if you're making hundreds of vectors very often: re-think your algorithm.
In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
Try with this:
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
I get almost exactly the same performance as with array.
The thing about vector is that it's a much more general tool than an array. And that means you have to consider how you use it. It can be used in a lot of different ways, providing functionality that an array doesn't even have. And if you use it "wrong" for your purpose, you incur a lot of overhead, but if you use it correctly, it is usually basically a zero-overhead data structure. In this case, the problem is that you separately initialized the vector (causing all elements to have their default ctor called), and then overwriting each element individually with the correct value. That is much harder for the compiler to optimize away than when you do the same thing with an array. Which is why the vector provides a constructor which lets you do exactly that: initialize N elements with value X.
And when you use that, the vector is just as fast as an array.
So no, you haven't busted the performance myth. But you have shown that it's only true if you use the vector optimally, which is a pretty good point too. :)
On the bright side, it's really the simplest usage that turns out to be fastest. If you contrast my code snippet (a single line) with John Kugelman's answer, containing heaps and heaps of tweaks and optimizations, which still don't quite eliminate the performance difference, it's pretty clear that vector is pretty cleverly designed after all. You don't have to jump through hoops to get speed equal to an array. On the contrary, you have to use the simplest possible solution.
Try disabling checked iterators and building in release mode. You shouldn't see much of a performance difference.
GNU's STL (and others), given vector<T>(n), default constructs a prototypal object T() - the compiler will optimise away the empty constructor - but then a copy of whatever garbage happened to be in the memory addresses now reserved for the object is taken by the STL's __uninitialized_fill_n_aux, which loops populating copies of that object as the default values in the vector. So, "my" STL is not looping constructing, but constructing then loop/copying. It's counter intuitive, but I should have remembered as I commented on a recent stackoverflow question about this very point: the construct/copy can be more efficient for reference counted objects etc..
So:
vector<T> x(n);
or
vector<T> x;
x.resize(n);
is - on many STL implementations - something like:
T temp;
for (int i = 0; i < n; ++i)
x[i] = temp;
The issue being that the current generation of compiler optimisers don't seem to work from the insight that temp is uninitialised garbage, and fail to optimise out the loop and default copy constructor invocations. You could credibly argue that compilers absolutely shouldn't optimise this away, as a programmer writing the above has a reasonable expectation that all the objects will be identical after the loop, even if garbage (usual caveats about 'identical'/operator== vs memcmp/operator= etc apply). The compiler can't be expected to have any extra insight into the larger context of std::vector<> or the later usage of the data that would suggest this optimisation safe.
This can be contrasted with the more obvious, direct implementation:
for (int i = 0; i < n; ++i)
x[i] = T();
Which we can expect a compiler to optimise out.
To be a bit more explicit about the justification for this aspect of vector's behaviour, consider:
std::vector<big_reference_counted_object> x(10000);
Clearly it's a major difference if we make 10000 independent objects versus 10000 referencing the same data. There's a reasonable argument that the advantage of protecting casual C++ users from accidentally doing something so expensive outweights the very small real-world cost of hard-to-optimise copy construction.
ORIGINAL ANSWER (for reference / making sense of the comments):
No chance. vector is as fast as an array, at least if you reserve space sensibly. ...
Martin York's answer bothers me because it seems like an attempt to brush the initialisation problem under the carpet. But he is right to identify redundant default construction as the source of performance problems.
[EDIT: Martin's answer no longer suggests changing the default constructor.]
For the immediate problem at hand, you could certainly call the 2-parameter version of the vector<Pixel> ctor instead:
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
That works if you want to initialise with a constant value, which is a common case. But the more general problem is: How can you efficiently initialise with something more complicated than a constant value?
For this you can use a back_insert_iterator, which is an iterator adaptor. Here's an example with a vector of ints, although the general idea works just as well for Pixels:
#include <iterator>
// Simple functor return a list of squares: 1, 4, 9, 16...
struct squares {
squares() { i = 0; }
int operator()() const { ++i; return i * i; }
private:
int i;
};
...
std::vector<int> v;
v.reserve(someSize); // To make insertions efficient
std::generate_n(std::back_inserter(v), someSize, squares());
Alternatively you could use copy() or transform() instead of generate_n().
The downside is that the logic to construct the initial values needs to be moved into a separate class, which is less convenient than having it in-place (although lambdas in C++1x make this much nicer). Also I expect this will still not be as fast as a malloc()-based non-STL version, but I expect it will be close, since it only does one construction for each element.
The vector ones are additionally calling Pixel constructors.
Each is causing almost a million ctor runs that you're timing.
edit: then there's the outer 1...1000 loop, so make that a billion ctor calls!
edit 2: it'd be interesting to see the disassembly for the UseArray case. An optimizer could optimize the whole thing away, since it has no effect other than burning CPU.
Here's how the push_back method in vector works:
The vector allocates X amount of space when it is initialized.
As stated below it checks if there is room in the current underlying array for the item.
It makes a copy of the item in the push_back call.
After calling push_back X items:
The vector reallocates kX amount of space into a 2nd array.
It Copies the entries of the first array onto the second.
Discards the first array.
Now uses the second array as storage until it reaches kX entries.
Repeat. If you're not reserving space its definitely going to be slower. More than that, if it's expensive to copy the item then 'push_back' like that is going to eat you alive.
As to the vector versus array thing, I'm going to have to agree with the other people. Run in release, turn optimizations on, and put in a few more flags so that the friendly people at Microsoft don't ##%$^ it up for ya.
One more thing, if you don't need to resize, use Boost.Array.
Some profiler data (pixel is aligned to 32 bits):
g++ -msse3 -O3 -ftree-vectorize -g test.cpp -DNDEBUG && ./a.out
UseVector completed in 3.123 seconds
UseArray completed in 1.847 seconds
UseVectorPushBack completed in 9.186 seconds
The whole thing completed in 14.159 seconds
Blah
andrey#nv:~$ opannotate --source libcchem/src/a.out | grep "Total samples for file" -A3
Overflow stats not available
* Total samples for file : "/usr/include/c++/4.4/ext/new_allocator.h"
*
* 141008 52.5367
*/
--
* Total samples for file : "/home/andrey/libcchem/src/test.cpp"
*
* 61556 22.9345
*/
--
* Total samples for file : "/usr/include/c++/4.4/bits/stl_vector.h"
*
* 41956 15.6320
*/
--
* Total samples for file : "/usr/include/c++/4.4/bits/stl_uninitialized.h"
*
* 20956 7.8078
*/
--
* Total samples for file : "/usr/include/c++/4.4/bits/stl_construct.h"
*
* 2923 1.0891
*/
In allocator:
: // _GLIBCXX_RESOLVE_LIB_DEFECTS
: // 402. wrong new expression in [some_] allocator::construct
: void
: construct(pointer __p, const _Tp& __val)
141008 52.5367 : { ::new((void *)__p) _Tp(__val); }
vector:
:void UseVector()
:{ /* UseVector() total: 60121 22.3999 */
...
:
:
10790 4.0201 : for (int i = 0; i < dimension * dimension; ++i) {
:
495 0.1844 : pixels[i].r = 255;
:
12618 4.7012 : pixels[i].g = 0;
:
2253 0.8394 : pixels[i].b = 0;
:
: }
array
:void UseArray()
:{ /* UseArray() total: 35191 13.1114 */
:
...
:
136 0.0507 : for (int i = 0; i < dimension * dimension; ++i) {
:
9897 3.6874 : pixels[i].r = 255;
:
3511 1.3081 : pixels[i].g = 0;
:
21647 8.0652 : pixels[i].b = 0;
Most of the overhead is in the copy constructor. For example,
std::vector < Pixel > pixels;//(dimension * dimension, Pixel());
pixels.reserve(dimension * dimension);
for (int i = 0; i < dimension * dimension; ++i) {
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
It has the same performance as an array.
My laptop is Lenova G770 (4 GB RAM).
The OS is Windows 7 64-bit (the one with laptop)
Compiler is MinGW 4.6.1.
The IDE is Code::Blocks.
I test the source codes of the first post.
The results
O2 optimization
UseArray completed in 2.841 seconds
UseVector completed in 2.548 seconds
UseVectorPushBack completed in 11.95 seconds
The whole thing completed in 17.342 seconds
system pause
O3 optimization
UseArray completed in 1.452 seconds
UseVector completed in 2.514 seconds
UseVectorPushBack completed in 12.967 seconds
The whole thing completed in 16.937 seconds
It looks like the performance of vector is worse under O3 optimization.
If you change the loop to
pixels[i].r = i;
pixels[i].g = i;
pixels[i].b = i;
The speed of array and vector under O2 and O3 are almost the same.
A better benchmark (I think...), compiler due to optimizations can change code, becouse results of allocated vectors/arrays are not used anywhere.
Results:
$ g++ test.cpp -o test -O3 -march=native
$ ./test
UseArray inner completed in 0.652 seconds
UseArray completed in 0.773 seconds
UseVector inner completed in 0.638 seconds
UseVector completed in 0.757 seconds
UseVectorPushBack inner completed in 6.732 seconds
UseVectorPush completed in 6.856 seconds
The whole thing completed in 8.387 seconds
Compiler:
gcc version 6.2.0 20161019 (Debian 6.2.0-9)
CPU:
model name : Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
And the code:
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector(std::vector<std::vector<Pixel> >& results)
{
TestTimer t("UseVector inner");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel>& pixels = results.at(i);
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack(std::vector<std::vector<Pixel> >& results)
{
TestTimer t("UseVectorPushBack inner");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel>& pixels = results.at(i);
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray(Pixel** results)
{
TestTimer t("UseArray inner");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
results[i] = pixels;
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
// free(pixels);
}
}
void UseArray()
{
TestTimer t("UseArray");
Pixel** array = (Pixel**)malloc(sizeof(Pixel*)* 1000);
UseArray(array);
for(int i=0;i<1000;++i)
free(array[i]);
free(array);
}
void UseVector()
{
TestTimer t("UseVector");
{
std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
UseVector(vector);
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPush");
{
std::vector<std::vector<Pixel> > vector(1000, std::vector<Pixel>());
UseVectorPushBack(vector);
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorPushBack();
return 0;
}
I did some extensive tests that I wanted to for a while now. Might as well share this.
This is my dual boot machine i7-3770, 16GB Ram, x86_64, on Windows 8.1 and on Ubuntu 16.04. More information and conclusions, remarks below. Tested both MSVS 2017 and g++ (both on Windows and on Linux).
Test Program
#include <iostream>
#include <chrono>
//#include <algorithm>
#include <array>
#include <locale>
#include <vector>
#include <queue>
#include <deque>
// Note: total size of array must not exceed 0x7fffffff B = 2,147,483,647B
// which means that largest int array size is 536,870,911
// Also image size cannot be larger than 80,000,000B
constexpr int long g_size = 100000;
int g_A[g_size];
int main()
{
std::locale loc("");
std::cout.imbue(loc);
constexpr int long size = 100000; // largest array stack size
// stack allocated c array
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
int A[size];
for (int i = 0; i < size; i++)
A[i] = i;
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "c-style stack array duration=" << duration / 1000.0 << "ms\n";
std::cout << "c-style stack array size=" << sizeof(A) << "B\n\n";
// global stack c array
start = std::chrono::steady_clock::now();
for (int i = 0; i < g_size; i++)
g_A[i] = i;
duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "global c-style stack array duration=" << duration / 1000.0 << "ms\n";
std::cout << "global c-style stack array size=" << sizeof(g_A) << "B\n\n";
// raw c array heap array
start = std::chrono::steady_clock::now();
int* AA = new int[size]; // bad_alloc() if it goes higher than 1,000,000,000
for (int i = 0; i < size; i++)
AA[i] = i;
duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "c-style heap array duration=" << duration / 1000.0 << "ms\n";
std::cout << "c-style heap array size=" << sizeof(AA) << "B\n\n";
delete[] AA;
// std::array<>
start = std::chrono::steady_clock::now();
std::array<int, size> AAA;
for (int i = 0; i < size; i++)
AAA[i] = i;
//std::sort(AAA.begin(), AAA.end());
duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "std::array duration=" << duration / 1000.0 << "ms\n";
std::cout << "std::array size=" << sizeof(AAA) << "B\n\n";
// std::vector<>
start = std::chrono::steady_clock::now();
std::vector<int> v;
for (int i = 0; i < size; i++)
v.push_back(i);
//std::sort(v.begin(), v.end());
duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "std::vector duration=" << duration / 1000.0 << "ms\n";
std::cout << "std::vector size=" << v.size() * sizeof(v.back()) << "B\n\n";
// std::deque<>
start = std::chrono::steady_clock::now();
std::deque<int> dq;
for (int i = 0; i < size; i++)
dq.push_back(i);
//std::sort(dq.begin(), dq.end());
duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "std::deque duration=" << duration / 1000.0 << "ms\n";
std::cout << "std::deque size=" << dq.size() * sizeof(dq.back()) << "B\n\n";
// std::queue<>
start = std::chrono::steady_clock::now();
std::queue<int> q;
for (int i = 0; i < size; i++)
q.push(i);
duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::steady_clock::now() - start).count();
std::cout << "std::queue duration=" << duration / 1000.0 << "ms\n";
std::cout << "std::queue size=" << q.size() * sizeof(q.front()) << "B\n\n";
}
Results
//////////////////////////////////////////////////////////////////////////////////////////
// with MSVS 2017:
// >> cl /std:c++14 /Wall -O2 array_bench.cpp
//
// c-style stack array duration=0.15ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.130ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.90ms
// c-style heap array size=4B
//
// std::array duration=0.20ms
// std::array size=400,000B
//
// std::vector duration=0.544ms
// std::vector size=400,000B
//
// std::deque duration=1.375ms
// std::deque size=400,000B
//
// std::queue duration=1.491ms
// std::queue size=400,000B
//
//////////////////////////////////////////////////////////////////////////////////////////
//
// with g++ version:
// - (tdm64-1) 5.1.0 on Windows
// - (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 on Ubuntu 16.04
// >> g++ -std=c++14 -Wall -march=native -O2 array_bench.cpp -o array_bench
//
// c-style stack array duration=0ms
// c-style stack array size=400,000B
//
// global c-style stack array duration=0.124ms
// global c-style stack array size=400,000B
//
// c-style heap array duration=0.648ms
// c-style heap array size=8B
//
// std::array duration=1ms
// std::array size=400,000B
//
// std::vector duration=0.402ms
// std::vector size=400,000B
//
// std::deque duration=0.234ms
// std::deque size=400,000B
//
// std::queue duration=0.304ms
// std::queue size=400,000
//
//////////////////////////////////////////////////////////////////////////////////////////
Notes
Assembled by an average of 10 runs.
I initially performed tests with std::sort() too (you can see it commented out) but removed them later because there were no significant relative differences.
My Conclusions and Remarks
notice how global c-style array takes almost as much time as the heap c-style array
Out of all tests I noticed a remarkable stability in std::array's time variations between consecutive runs, while others especially std:: data structs varied wildly in comparison
O3 optimization didn't show any noteworthy time differences
Removing optimization on Windows cl (no -O2) and on g++ (Win/Linux no -O2, no -march=native) increases times SIGNIFICANTLY. Particularly for std::data structs. Overall higher times on MSVS than g++, but std::array and c-style arrays faster on Windows without optimization
g++ produces faster code than microsoft's compiler (apparently it runs faster even on Windows).
Verdict
Of course this is code for an optimized build. And since the question was about std::vector then yes it is !much! slower than plain arrays (optimized/unoptimized). But when you're doing a benchmark, you naturally want to produce optimized code.
The star of the show for me though has been std::array.
With the right options, vectors and arrays can generate identical asm. In these cases, they are of course the same speed, because you get the same executable file either way.
By the way the slow down your seeing in classes using vector also occurs with standard types like int. Heres a multithreaded code:
#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <typeinfo>
#include <vector>
#include <pthread.h>
#include <sstream>
#include <fstream>
using namespace std;
//pthread_mutex_t map_mutex=PTHREAD_MUTEX_INITIALIZER;
long long num=500000000;
int procs=1;
struct iterate
{
int id;
int num;
void * member;
iterate(int a, int b, void *c) : id(a), num(b), member(c) {}
};
//fill out viterate and piterate
void * viterate(void * input)
{
printf("am in viterate\n");
iterate * info=static_cast<iterate *> (input);
// reproduce member type
vector<int> test= *static_cast<vector<int>*> (info->member);
for (int i=info->id; i<test.size(); i+=info->num)
{
//printf("am in viterate loop\n");
test[i];
}
pthread_exit(NULL);
}
void * piterate(void * input)
{
printf("am in piterate\n");
iterate * info=static_cast<iterate *> (input);;
int * test=static_cast<int *> (info->member);
for (int i=info->id; i<num; i+=info->num) {
//printf("am in piterate loop\n");
test[i];
}
pthread_exit(NULL);
}
int main()
{
cout<<"producing vector of size "<<num<<endl;
vector<int> vtest(num);
cout<<"produced a vector of size "<<vtest.size()<<endl;
pthread_t thread[procs];
iterate** it=new iterate*[procs];
int ans;
void *status;
cout<<"begining to thread through the vector\n";
for (int i=0; i<procs; i++) {
it[i]=new iterate(i, procs, (void *) &vtest);
// ans=pthread_create(&thread[i],NULL,viterate, (void *) it[i]);
}
for (int i=0; i<procs; i++) {
pthread_join(thread[i], &status);
}
cout<<"end of threading through the vector";
//reuse the iterate structures
cout<<"producing a pointer with size "<<num<<endl;
int * pint=new int[num];
cout<<"produced a pointer with size "<<num<<endl;
cout<<"begining to thread through the pointer\n";
for (int i=0; i<procs; i++) {
it[i]->member=&pint;
ans=pthread_create(&thread[i], NULL, piterate, (void*) it[i]);
}
for (int i=0; i<procs; i++) {
pthread_join(thread[i], &status);
}
cout<<"end of threading through the pointer\n";
//delete structure array for iterate
for (int i=0; i<procs; i++) {
delete it[i];
}
delete [] it;
//delete pointer
delete [] pint;
cout<<"end of the program"<<endl;
return 0;
}
The behavior from the code shows the instantiation of vector is the longest part of the code. Once you get through that bottle neck. The rest of the code runs extremely fast. This is true no matter how many threads you are running on.
By the way ignore the absolutely insane number of includes. I have been using this code to test things for a project so the number of includes keep growing.
I just want to mention that vector (and smart_ptr) is just a thin layer add on top of raw arrays (and raw pointers).
And actually the access time of an vector in continuous memory is faster than array.
The following code shows the result of initialize and access vector and array.
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
#include <vector>
#define SIZE 20000
int main() {
srand (time(NULL));
vector<vector<int>> vector2d;
vector2d.reserve(SIZE);
int index(0);
boost::posix_time::ptime start_total = boost::posix_time::microsec_clock::local_time();
// timer start - build + access
for (int i = 0; i < SIZE; i++) {
vector2d.push_back(vector<int>(SIZE));
}
boost::posix_time::ptime start_access = boost::posix_time::microsec_clock::local_time();
// timer start - access
for (int i = 0; i < SIZE; i++) {
index = rand()%SIZE;
for (int j = 0; j < SIZE; j++) {
vector2d[index][index]++;
}
}
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration msdiff = end - start_total;
cout << "Vector total time: " << msdiff.total_milliseconds() << "milliseconds.\n";
msdiff = end - start_acess;
cout << "Vector access time: " << msdiff.total_milliseconds() << "milliseconds.\n";
int index(0);
int** raw2d = nullptr;
raw2d = new int*[SIZE];
start_total = boost::posix_time::microsec_clock::local_time();
// timer start - build + access
for (int i = 0; i < SIZE; i++) {
raw2d[i] = new int[SIZE];
}
start_access = boost::posix_time::microsec_clock::local_time();
// timer start - access
for (int i = 0; i < SIZE; i++) {
index = rand()%SIZE;
for (int j = 0; j < SIZE; j++) {
raw2d[index][index]++;
}
}
end = boost::posix_time::microsec_clock::local_time();
msdiff = end - start_total;
cout << "Array total time: " << msdiff.total_milliseconds() << "milliseconds.\n";
msdiff = end - start_acess;
cout << "Array access time: " << msdiff.total_milliseconds() << "milliseconds.\n";
for (int i = 0; i < SIZE; i++) {
delete [] raw2d[i];
}
return 0;
}
The output is:
Vector total time: 925milliseconds.
Vector access time: 4milliseconds.
Array total time: 30milliseconds.
Array access time: 21milliseconds.
So the speed will be almost the same if you use it properly.
(as others mentioned using reserve() or resize()).
Well, because vector::resize() does much more processing than plain memory allocation (by malloc).
Try to put a breakpoint in your copy constructor (define it so that you can breakpoint!) and there goes the additional processing time.
I Have to say I'm not an expert in C++. But to add some experiments results:
compile:
gcc-6.2.0/bin/g++ -O3 -std=c++14 vector.cpp
machine:
Intel(R) Xeon(R) CPU E5-2690 v2 # 3.00GHz
OS:
2.6.32-642.13.1.el6.x86_64
Output:
UseArray completed in 0.167821 seconds
UseVector completed in 0.134402 seconds
UseConstructor completed in 0.134806 seconds
UseFillConstructor completed in 1.00279 seconds
UseVectorPushBack completed in 6.6887 seconds
The whole thing completed in 8.12888 seconds
Here the only thing I feel strange is that "UseFillConstructor" performance compared with "UseConstructor".
The code:
void UseConstructor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension*dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseFillConstructor()
{
TestTimer t("UseFillConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension*dimension, Pixel(255,0,0));
}
}
So the additional "value" provided slows down performance quite a lot, which I think is due to multiple call to copy constructor. But...
Compile:
gcc-6.2.0/bin/g++ -std=c++14 -O vector.cpp
Output:
UseArray completed in 1.02464 seconds
UseVector completed in 1.31056 seconds
UseConstructor completed in 1.47413 seconds
UseFillConstructor completed in 1.01555 seconds
UseVectorPushBack completed in 6.9597 seconds
The whole thing completed in 11.7851 seconds
So in this case, gcc optimization is very important but it can't help you much when a value is provided as default. This, is against my tuition actually. Hopefully it helps new programmer when choose which vector initialization format.
It seems to depend on the compiler flags. Here is a benchmark code:
#include <chrono>
#include <cmath>
#include <ctime>
#include <iostream>
#include <vector>
int main(){
int size = 1000000; // reduce this number in case your program crashes
int L = 10;
std::cout << "size=" << size << " L=" << L << std::endl;
{
srand( time(0) );
double * data = new double[size];
double result = 0.;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for( int l = 0; l < L; l++ ) {
for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
for( int i = 0; i < size; i++ ) result += data[i] * data[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "Calculation result is " << sqrt(result) << "\n";
std::cout << "Duration of C style heap array: " << duration << "ms\n";
delete data;
}
{
srand( 1 + time(0) );
double data[size]; // technically, non-compliant with C++ standard.
double result = 0.;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for( int l = 0; l < L; l++ ) {
for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
for( int i = 0; i < size; i++ ) result += data[i] * data[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "Calculation result is " << sqrt(result) << "\n";
std::cout << "Duration of C99 style stack array: " << duration << "ms\n";
}
{
srand( 2 + time(0) );
std::vector<double> data( size );
double result = 0.;
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
for( int l = 0; l < L; l++ ) {
for( int i = 0; i < size; i++ ) data[i] = rand() % 100;
for( int i = 0; i < size; i++ ) result += data[i] * data[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "Calculation result is " << sqrt(result) << "\n";
std::cout << "Duration of std::vector array: " << duration << "ms\n";
}
return 0;
}
Different optimization flags give different answers:
$ g++ -O0 benchmark.cpp
$ ./a.out
size=1000000 L=10
Calculation result is 181182
Duration of C style heap array: 118441ms
Calculation result is 181240
Duration of C99 style stack array: 104920ms
Calculation result is 181210
Duration of std::vector array: 124477ms
$g++ -O3 benchmark.cpp
$ ./a.out
size=1000000 L=10
Calculation result is 181213
Duration of C style heap array: 107803ms
Calculation result is 181198
Duration of C99 style stack array: 87247ms
Calculation result is 181204
Duration of std::vector array: 89083ms
$ g++ -Ofast benchmark.cpp
$ ./a.out
size=1000000 L=10
Calculation result is 181164
Duration of C style heap array: 93530ms
Calculation result is 181179
Duration of C99 style stack array: 80620ms
Calculation result is 181191
Duration of std::vector array: 78830ms
Your exact results will vary but this is quite typical on my machine.
In my experience, sometimes, just sometimes, vector<int> can be many times slower than int[]. One thing to keep mind is that vectors of vectors are very unlike int[][]. As the elements are probably not contiguous in memory. This means you can resize different vectors inside of the main one, but CPU might not be able to cache elements as well as in the case of int[][].