I'm experiencing strange memory access performance problem, any ideas?
int* pixel_ptr = somewhereFromHeap;
int local_ptr[307200]; //local
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = someCalculatedVal ;
}
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = 1 ; //constant
}
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = someCalculatedVal ;
}
Tried consolidating values to local scanline
int scanline[640]; // local
//this is very slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = val ;
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = 1 ; //constant
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = i; //or a constant
pixel_ptr[screen_pos] = val ;
}
//this is slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[0];
pixel_ptr[screen_pos] = val ;
}
Any ideas? I'm using mingw with cflags -01 -std=c++11 -fpermissive.
update4:
I have to say that these are snippets from my program and there are heavy code/functions running before and after. The scanline block did ran at the end of function before exit.
Now with proper test program. thks to #Iwillnotexist.
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define SIZE 307200
#define SAMPLES 1000
double local_test(){
int local_array[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
local_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
double heap_test(){
int* heap_array=new int[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
delete[] heap_array;
return cpu_time_used;
}
double heap_test2(){
static int* heap_array = NULL;
if(heap_array==NULL){
heap_array = new int[SIZE];
}
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
int main (int argc, char** argv){
double cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=local_test();
printf("local: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test();
printf("heap_: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test2();
printf("heap2: %f ms\n",cpu_time_used);
}
Complied with no optimization.
local: 577.201000 ms
heap_: 826.802000 ms
heap2: 686.401000 ms
The first heap test with new and delete is 2x slower. (paging as suggested?)
The second heap with reused heap array is still 1.2x slower.
But I guess the second test is not that practical as there tend to other codes running before and after at least for my case. For my case, my pixel_ptr of course only allocated once during
prograim initialization.
But if anyone has solutions/idea to speeding things up please reply!
I'm still perplexed why heap write is so much slower than stack segment.
Surely there must be some tricks to make the heap more cpu/cache flavourable.
Final update?:
I revisited, the disassemblies again and this time, suddenly I have an idea why some of my breakpoints
don't activate. The program looks suspiciously shorter thus I suspect the complier might
have removed the redundant dummy code I put in which explains why the local array is magically many times faster.
I was a bit curious so I did the test, and indeed I could measure a difference between stack and heap access.
The first guess would be that the generated assembly is different, but after taking a look, it is actually identical for heap and stack (which makes sense, memory shouldn't be discriminated).
If the assembly is the same, then the difference must come from the paging mechanism. The guess is that on the stack, the pages are already allocated, but on the heap, first access cause a page fault and page allocation (invisible, it all happens at kernel level). To verify this, I did the same test, but first I would access the heap once before measuring. The test gave identical times for stack and heap. To be sure, I also did a test in which I first accessed the heap, but only every 4096 bytes (every 1024 int), then 8192, because a page is usually 4096 bytes long. The result is that accessing only every 4096 bytes also gives the same time for heap and stack, but accessing every 8192 gives a difference, but not as much as with no previous access at all. This is because only half of the pages were accessed and allocated beforehand.
So the answer is that on the stack, memory pages are already allocated, but on the heap, pages are allocated on-the-fly. This depends on the OS paging policy, but all major PC OSes probably have a similar one.
For all the tests I used Windows, with MS compiler targeting x64.
EDIT: For the test, I measured a single, larger loop, so there was only one access at each memory location. deleteing the array and measuring the same loop multiple time should give similar times for stack and heap, because deleteing memory probably don't de-allocate the pages, and they are already allocated for the next loop (if the next new allocated on the same space).
The following two code examples should not differ in runtime with a good compiler setting. Probably your compiler will generate the same code:
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = pixel_ptr[i];
}
Please try to increase optimization setting.
Related
I have a sample C++ program as below:
#include <windows.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
When I run it with Visual Studio 32-bit Debug, it will run with following result:
The program can use nearly 2Gb of memory before out of memory.
This is normal behavior.
However, when I adding the code to start Thread inside the for loop as below:
#include <windows.h>
#include <stdio.h>
DWORD WINAPI thread_func(VOID* pInArgs)
{
Sleep(100000);
return 0;
}
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
CreateThread(NULL, 0, thread_func, NULL, 0, NULL);
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
The result is as below:
The memory is still just around 200Mb but function malloc will return NULL.
Could anyone help explain why the program cannot use the memory up to 2Gb before out of memory?
Is it mean creating many threads like above will cause memory leak?
In my real application, this error occur when I create about 800 threads, the RAM memory at the time "out of memory" is around 300Mb.
As noted in a comment by #macroland, the main thing happening here is that each thread is consuming 1 MiB for its stack (see MSDN CreateThread and Thread Stack Size). You say malloc returns NULL once the total you have directly allocated reaches 200 MB. Since you are allocating 131125 bytes at a time, that is 200 MB / 131125 B = 1525 threads. Their cumulative stack space will be around 1.5 GB. Adding the 200 MB of malloc memory is 1.7 GB, and miscellaneous overhead likely accounts for the rest.
So, why does Task Manager not show this? Because the full 1 MiB of thread stack space is not actually allocated (also called committed), rather it is reserved. See VirtualAlloc and the MEM_RESERVE flag. The address space has been reserved for expansion up to 1 MiB, but initially only 64 KiB are allocated, and Task Manager only counts the latter. But reserved memory will not be unilaterally repurposed by malloc until the reservation is lifted, so once it runs out of available address space, it has to return NULL.
What tool can show this? I don't know of anything off the shelf (even Process Explorer does not seem show a count of reserved memory). What I have done in the past is write my own little routine that uses VirtualQuery to enumerate the entire address space, including reserved ranges. I recommend you do the same; it's not much code to write, and very handy when coding for 32-bit Windows because the 2 GiB address space gets cramped very easily (DLLs are an obvious reason, but the default malloc also will leave unexpected reservations behind in response to certain allocation patterns even if you free everything).
In any case, if you want to create thousands of threads in a 32-bit Windows process, be sure to pass a non-zero value as the dwStackSize parameter to CreateThread, and also pass STACK_SIZE_PARAM_IS_A_RESERVATION as dwCreationFlags. The minimum is 64 KiB, which will be plenty if you avoid recursive algorithms in the threads.
Addendum: In a comment, #iinspectable cautions against using thousands of threads, citing Raymond Chen's 2005 blog post Does Windows have a limit of 2000 threads per process?. I agree that doing so is questionable for a variety of reasons; it is not my intent to endorse the practice, rather I'm just explaining one necessary element.
This question already has answers here:
Performance: memset
(2 answers)
Why might std::vector be faster than a raw dynamically allocated array?
(2 answers)
Why is iterating though `std::vector` faster than iterating though `std::array`?
(2 answers)
Idiomatic way of performance evaluation?
(1 answer)
Closed 3 years ago.
When I run the following program (with optimization on), the for loop with the std::vector takes about 0.04 seconds while the for loop with the array takes 0.0001 seconds.
#include <iostream>
#include <vector>
#include <chrono>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "The vector took " << elapsed.count() << "seconds\n";
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
finish = std::chrono::high_resolution_clock::now();
elapsed = finish - start;
std::cout << "The array took " << elapsed.count() << "seconds \n";
char s;
std::cin >> s;
delete[] Data;
}
The code is a simplified version of a performance issue I was having while writing a raycaster. The len variable corresponds to how many times the loop in the original program needs to run (400 pixels * 400 pixels * 50 maximum render distance). For complicated reasons (perhaps that I don't fully understand how to use arrays) I have to use a vector rather than an array in the actual raycaster. However, as this program demonstrates, that would only give me 20 frames per second as opposed to the envied 10,000 frames per second that using an array would supposedly give me (obviously, this is just a simplified performance test). But regardless of how accurate those numbers are, I still want to boost my frame rate as much as possible. So, why is the vector performing so much slower than the array? Is there a way to speed it up? Thanks for your help. And if there's anything else I'm doing weirdly that might be affecting performance, please let me know. I didn't even know about optimization until researching an answer for this question, so if there are any more things like that which might boost the performance, please let me know (and I'd prefer if you explained where those settings are in the properties manager rather than command line since I don't yet know how to use the command line)
Let us observe how GCC optimizes this test program:
#include <vector>
int main()
{
int len = 800000;
int* Data = new int[len];
int arr[3] = { 255, 0, 0 };
std::vector<int> vec = { 255, 0, 0 };
for (int i = 0; i < len; i++) {
Data[i] = vec[0];
}
for (int i = 0; i < len; i++) {
Data[i] = arr[0];
}
delete[] Data;
}
The compiler rightly notices that the vector is constant, and eliminates it. Exactly same code is generated for both loops. Therefore it should be irrelevant whether the first loop uses array or vector.
.L2:
movups XMMWORD PTR [rcx], xmm0
add rcx, 16
cmp rsi, rcx
jne .L2
What makes difference in your test program is the order of loops. The comments point out that when a third loop is added to the beginning, both loops take the same time.
I would expect that with a modern compiler accessing a vector would be approximately as fast as accessing an array, when optimization is enabled and debug is disabled. If there is an observable difference in your actual program, the problem lies somewhere else.
It is about caches. I dont know how it works detailed but Data[] is getting known better by cpu while it is used. If you reverse the order of calculation you can see 'vector is faster'.
But actually, you are testing neither vector nor array. Let's assume that vec[0] resides at 0x01 memory location, arr[0] resides at 0xf1. Only difference is reading a word from different single memory adresses. So you are testing how fast can I assign a value to elements of dynamically allocated array.
Note: std::chrono::high_resolution_clock might not be sufficient to measure ticks. It is better to use steady_clock as cppreference says.
I have some program that allocate memory a lot, I hoped to boost it's speed by splitting task on threads, but it made my program only slower.
I made this minimal example that has nothing to do with my real code aside of the fact it allocate memory in different threads.
class ThreadStartInfo
{
public:
unsigned char *arr_of_5m_elems;
bool TaskDoneFlag;
ThreadStartInfo()
{
this->TaskDoneFlag = false;
this->arr_of_5m_elems = NULL;
}
~ThreadStartInfo()
{
if (this->arr_of_5m_elems)
free(this->arr_of_5m_elems);
}
};
unsigned long __stdcall CalcSomething(void *tsi_ptr)
{
ThreadStartInfo *tsi = (ThreadStartInfo*)tsi_ptr;
for (int i = 0; i < 5000000; i++)
{
double *test_ptr = (double*)malloc(tsi->arr_of_5m_elems[i] * sizeof(double));
memset(test_ptr, 0, tsi->arr_of_5m_elems[i] * sizeof(double));
free(test_ptr);
}
tsi->TaskDoneFlag = true;
return 0;
}
void main()
{
ThreadStartInfo *tsi1 = new ThreadStartInfo();
tsi1->arr_of_5m_elems = (unsigned char*)malloc(5000000 * sizeof(unsigned char));
ThreadStartInfo *tsi2 = new ThreadStartInfo();
tsi2->arr_of_5m_elems = (unsigned char*)malloc(5000000 * sizeof(unsigned char));
ThreadStartInfo **tsi_arr = (ThreadStartInfo**)malloc(2 * sizeof(ThreadStartInfo*));
tsi_arr[0] = tsi1;
tsi_arr[1] = tsi2;
time_t start_dt = time(NULL);
CalcSomething(tsi1);
CalcSomething(tsi2);
printf("Task done in %i seconds.\n", time(NULL) - start_dt);
//--
tsi1->TaskDoneFlag = false;
tsi2->TaskDoneFlag = false;
//--
start_dt = time(NULL);
unsigned long th1_id = 0;
void *th1h = CreateThread(NULL, 0, CalcSomething, tsi1, 0, &th1_id);
unsigned long th2_id = 0;
void *th2h = CreateThread(NULL, 0, CalcSomething, tsi2, 0, &th2_id);
retry:
for (int i = 0; i < 2; i++)
if (!tsi_arr[i]->TaskDoneFlag)
{
Sleep(100);
goto retry;
}
CloseHandle(th1h);
CloseHandle(th2h);
printf("MT Task done in %i seconds.\n", time(NULL) - start_dt);
}
It prints me such results:
Task done in 16 seconds.
MT Task done in 19 seconds.
And... I didn't expected slow down. Is there anyway to make memory allocations faster in multiple threads?
Apart from some undefined behavior due to lack of synchronization on TaskDoneFlag, all the threads are doing is calling malloc/free repeatedly.
The Visual C++ CRT heap is single-threaded1, as malloc/free delegate to HeapAlloc/HeapFree which execute in a critical section (only one thread at a time). Calling them from more than one thread at a time will never be faster than a single thread, and often slower due to the lock contention overhead.
Either reduce allocations in threads or switch to another memory allocator, like jemalloc or tcmalloc.
1 See this note for HeapAlloc:
Serialization ensures mutual exclusion when two or more threads attempt to simultaneously allocate or free blocks from the same heap. There is a small performance cost to serialization, but it must be used whenever multiple threads allocate and free memory from the same heap. Setting the HEAP_NO_SERIALIZE value eliminates mutual exclusion on the heap. Without serialization, two or more threads that use the same heap handle might attempt to allocate or free memory simultaneously, likely causing corruption in the heap.
This code results in x pointing to a chunk of memory 100GB in size.
#include <stdlib.h>
#include <stdio.h>
int main() {
auto x = malloc(1);
for (int i = 1; i< 1024; ++i) x = realloc(x, i*1024ULL*1024*100);
while (true); // Give us time to check top
}
While this code fails allocation.
#include <stdlib.h>
#include <stdio.h>
int main() {
auto x = malloc(1024ULL*1024*100*1024);
printf("%llu\n", x);
while (true); // Give us time to check top
}
My guess is, that the memory size of your system is less than the 100 GiB that you are trying to allocate. While Linux does overcommit memory, it still bails out of requests that are way beyond what it can fulfill. That is why the second example fails.
The many small increments of the first example, on the other hand, are way below that threshold. So each one of them succeeds as the kernel knows that you didn't require any of the prior memory yet, so it has no indication that it won't be able to back those 100 additional MiB.
I believe that the threshold for when a memory request from a process fails is relative to the available RAM, and that it can be adjusted (though I don't remember how exactly).
Well you're allocating less memory in the one that succeeds:
for (int i = 1; i< 1024; ++i) x = realloc(x, i*1024ULL*1024*100);
The last realloc is:
x = realloc(x, 1023 * (1024ULL*1024*100));
As compared to:
auto x = malloc(1024 * (1024ULL*100*1024));
Maybe that's right where your memory boundary is - the last 100M that broke the camel's back?
when using C++ vector, time spent is 718 milliseconds,
while when I use Array, time is almost 0 milliseconds.
Why so much performance difference?
int _tmain(int argc, _TCHAR* argv[])
{
const int size = 10000;
clock_t start, end;
start = clock();
vector<int> v(size*size);
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
v[i*size+j] = 1;
}
}
end = clock();
cout<< (end - start)
<<" milliseconds."<<endl; // 718 milliseconds
int f = 0;
start = clock();
int arr[size*size];
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
arr[i*size+j] = 1;
}
}
end = clock();
cout<< ( end - start)
<<" milliseconds."<<endl; // 0 milliseconds
return 0;
}
Your array arr is allocated on the stack, i.e., the compiler has calculated the necessary space at compile time. At the beginning of the method, the compiler will insert an assembler statement like
sub esp, 10000*10000*sizeof(int)
which means the stack pointer (esp) is decreased by 10000 * 10000 * sizeof(int) bytes to make room for an array of 100002 integers. This operation is almost instant.
The vector is heap allocated and heap allocation is much more expensive. When the vector allocates the required memory, it has to ask the operating system for a contiguous chunk of memory and the operating system will have to perform significant work to find this chunk of memory.
As Andreas says in the comments, all your time is spent in this line:
vector<int> v(size*size);
Accessing the vector inside the loop is just as fast as for the array.
For an additional overview see e.g.
[What and where are the stack and heap?
[http://computer.howstuffworks.com/c28.htm][2]
[http://www.cprogramming.com/tutorial/virtual_memory_and_heaps.html][3]
Edit:
After all the comments about performance optimizations and compiler settings, I did some measurements this morning. I had to set size=3000 so I did my measurements with roughly a tenth of the original entries. All measurements performed on a 2.66 GHz Xeon:
With debug settings in Visual Studio 2008 (no optimization, runtime checks, and debug runtime) the vector test took 920 ms compared to 0 ms for the array test.
98,48 % of the total time was spent in vector::operator[], i.e., the time was indeed spent on the runtime checks.
With full optimization, the vector test needed 56 ms (with a tenth of the original number of entries) compared to 0 ms for the array.
The vector ctor required 61,72 % of the total application running time.
So I guess everybody is right depending on the compiler settings used. The OP's timing suggests an optimized build or an STL without runtime checks.
As always, the morale is: profile first, optimize second.
If you are compiling this with a Microsoft compiler, to make it a fair comparison you need to switch off iterator security checks and iterator debugging, by defining _SECURE_SCL=0 and _HAS_ITERATOR_DEBUGGING=0.
Secondly, the constructor you are using initialises each vector value with zero, and you are not memsetting the array to zero before filling it. So you are traversing the vector twice.
Try:
vector<int> v;
v.reserve(size*size);
To get a fair comparison I think something like the following should be suitable:
#include <sys/time.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include <numeric>
int main()
{
static size_t const size = 7e6;
timeval start, end;
int sum;
gettimeofday(&start, 0);
{
std::vector<int> v(size, 1);
sum = std::accumulate(v.begin(), v.end(), 0);
}
gettimeofday(&end, 0);
std::cout << "= vector =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
gettimeofday(&start, 0);
int * const arr = new int[size];
std::fill(arr, arr + size, 1);
sum = std::accumulate(arr, arr + size, 0);
delete [] arr;
gettimeofday(&end, 0);
std::cout << "= Simple array =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
}
In both cases, dynamic allocation and deallocation is performed, as well as accesses to elements.
On my Linux box:
$ g++ -O2 foo.cpp
$ ./a.out
= vector =
(0 s, 21085 us)
sum = 7000000
= Simple array =
(0 s, 21148 us)
sum = 7000000
Both the std::vector<> and array cases have comparable performance. The point is that std::vector<> can be just as fast as a simple array if your code is structured appropriately.
On a related note switching off optimization makes a huge difference in this case:
$ g++ foo.cpp
$ ./a.out
= vector =
(0 s, 120357 us)
sum = 7000000
= Simple array =
(0 s, 60569 us)
sum = 7000000
Many of the optimization assertions made by folks like Neil and jalf are entirely correct.
HTH!
EDIT: Corrected code to force vector destruction to be included in time measurement.
Change assignment to eg. arr[i*size+j] = i*j, or some other non-constant expression. I think compiler optimizes away whole loop, as assigned values are never used, or replaces array with some precalculated values, so that loop isn't even executed and you get 0 milliseconds.
Having changed 1 to i*j, i get the same timings for both vector and array, unless pass -O1 flag to gcc, then in both cases I get 0 milliseconds.
So, first of all, double-check whether your loops are actually executed.
You are probably using VC++, in which case by default standard library components perform many checks at run-time (e.g whether index is in range). These checks can be turned off by defining some macros as 0 (I think _SECURE_SCL).
Another thing is that I can't even run your code as is: the automatic array is way too large for the stack. When I make it global, then with MingW 3.5 the times I get are 627 ms for the vector and 26875 ms (!!) for the array, which indicates there are really big problems with an array of this size.
As to this particular operation (filling with value 1), you could use the vector's constructor:
std::vector<int> v(size * size, 1);
and the fill algorithm for the array:
std::fill(arr, arr + size * size, 1);
Two things. One, operator[] is much slower for vector. Two, vector in most implementations will behave weird at times when you add in one element at a time. I don't mean just that it allocates more memory but it does some genuinely bizarre things at times.
The first one is the main issue. For a mere million bytes, even reallocating the memory a dozen times should not take long (it won't do it on every added element).
In my experiments, preallocating doesn't change its slowness much. When the contents are actual objects it basically grinds to a halt if you try to do something simple like sort it.
Conclusion, don't use stl or mfc vectors for anything large or computation heavy. They are implemented poorly/slowly and cause lots of memory fragmentation.
When you declare the array, it lives in the stack (or in static memory zone), which it's very fast, but can't increase its size.
When you declare the vector, it assign dynamic memory, which it's not so fast, but is more flexible in the memory allocation, so you can change the size and not dimension it to the maximum size.
When profiling code, make sure you are comparing similar things.
vector<int> v(size*size);
initializes each element in the vector,
int arr[size*size];
doesn't. Try
int arr[size * size];
memset( arr, 0, size * size );
and measure again...