I'm trying to understand why I am getting std::bad_alloc exceptions when I seem to have enough (virtual?) memory available to me.
Essentially I have a prime number generator (Eratosthenes sieve (not segmented yet)) where I'm newing bools for an indicator array, and then newing ints for the primes I've found under a bound I specify on the command line.
I have 1GB RAM (some of this will be hogged by my OS (ubuntu 10.04), and probably some of it is not available as heap memory (am I wrong here?)) and 2.8 GB of swap space (I believe that was auto set for me when installing Ubuntu)
If I set an upper bound of 600000000 then I'm asking for 0.6 GB of memory for my indicator array and roughly 30000000*4 bytes (slight over estimate given there are 26355867 primes less than 500000000) for my primes array, and a few variables here and there; this means I'm asking for about .72 (+ negligible) GB of memory which I believe should be covered by the swap space available to me (I am aware touching that stuff will slow my program down ridiculously). However I am getting std::bad_allocs.
Could anyone point out what I'm missing here? (one last thing having changed long long ints to ints before pasting my last error was a seg fault (my numbers are way below 2^31 though so I can't see where I'm overflowing) - still trying to figure that one out)
My code is as follows (and without taking away from me the benefit of my own investigation into quicker algorithms etc.. I'd be appreciative of any code improvements here! (i.e. if I'm committing major no-no s))
main.cpp
#include <iostream>
#include <cmath>
#include "Prime.hpp"
#include <ctime>
#include <stdio.h>
#include <cstring>
//USAGE: execute program with the nth prime you want and an upper bound for finding primes --too high may cause bad alloc
int main(int argc, const char *argv[])
{
int a = strlen(argv[1]);
clock_t start = clock();
if(argc != 2)
{
std::cout << "USAGE: Enter a positive inputNumber <= 500000000.\n"
<< "This inputNumber is an upper bound for the primes that can be found\n";
return -1;
}
const char* primeBound = argv[1];
int inputNum = 0;
for(int i = 0; i < strlen(argv[1]); i++)
{
if(primeBound[i] < 48 || primeBound[i] > 57 || primeBound[0] == 48)
{
std::cout << "USAGE: Enter a positive inputNumber <= 500000000.\n"
<< "This inputNumber is an upper bound for the primes that can be found\n";
return -1;
}
inputNum = (int)(primeBound[i]-48) + (10 * inputNum);
}
if(inputNum > 600000000)//getting close to the memory limit for this machine (1GB - memory used by the OS):
//(each bool takes 1 byte and I'd be asking for more than 500 million of these
//and I'd also asking for over 100000000 bytes to store the primes > 0.6 GB)
{
std::cout << "USAGE: Enter a positive inputNumber <= 500000000.\n"
<< "This inputNumber is an upper bound for the primes that can be found\n";
return -1;
}
Prime p(inputNum);
std::cout << "the largest prime less than " << inputNum << " is: " << p.getPrime(p.getNoOfPrimes()) << "\n";
std::cout << "Number of primes: " << p.getNoOfPrimes() << "\n";
std::cout << ((double)clock() - start) / CLOCKS_PER_SEC << "\n";
return 0;
}
Prime.hpp
#ifndef PRIME_HPP
#define PRIME_HPP
#include <iostream>
#include <cmath>
class Prime
{
int lastStorageSize;
bool* primeIndicators;
int* primes;
int noOfPrimes;
void allocateIndicatorArray(int num);
void allocatePrimesArray();
void generateIndicators();
void generatePrimeList();
Prime(){}; //forcing constructor with param = size
public:
Prime(int num);
int getNoOfPrimes();
int getPrime(int nthPrime);
~Prime(){delete [] primeIndicators; delete [] primes;}
};
#endif
Prime.cpp
#include "Prime.hpp"
#include <iostream>
//don't know how much memory I will need so allocate on the heap
void Prime::allocateIndicatorArray(int num)
{
try
{
primeIndicators = new bool[num];
}
catch(std::bad_alloc ba)
{
std::cout << "not enough memory :[";
//if I'm looking for a particular prime I might have over-allocated here anyway...might be worth
//decreasing num and trying again - if this is possible!
}
lastStorageSize = num;
}
void Prime::allocatePrimesArray()
{
//could probably speed up generateIndicators() if, using some prime number theory, I slightly over allocate here
//since that would cut down the operations dramatically (a small procedure done many times made smaller)
try
{
primes = new int[lastStorageSize];
}
catch(std::bad_alloc ba)
{
std::cout << "not enough memory :[";
//if I'm looking for a particular prime I might have over-allocated here anyway...might be worth
//decreasing num and trying again - if this is possible!
}
}
void Prime::generateIndicators()
{
//first identify the primes -- if we see a 0 then start flipping all elements that are multiples of i starting from i*i (these will not be prime)
int numPrimes = lastStorageSize - 2; //we'll be starting at i = 2 (so numPrimes is at least 2 less than lastStorageSize)
for(int i=4; i < lastStorageSize; i+=2)
{
primeIndicators[i]++; //dispense with all the even numbers (barring 2 - that one's prime)
numPrimes--;
}
//TODO here I'm multiple counting the same things...not cool >;[
//may cost too much to avoid this wastage unfortunately
for(int i=3; i < sqrt(double(lastStorageSize)); i+=2) //we start j at i*i hence the square root
{
if(primeIndicators[i] == 0)
{
for(int j = i*i; j < lastStorageSize; j = j+(2*i)) //note: i is prime, and we'll have already sieved any j < i*i
{
if(primeIndicators[j] == 0)
{
numPrimes--;//we are not checking each element uniquely yet :/
primeIndicators[j]=1;
}
}
}
}
noOfPrimes = numPrimes;
}
void Prime::generatePrimeList()
{
//now we go and get the primes, i.e. wherever we see zero in primeIndicators[] then populate primes with the value of i
int primesCount = 0;
for(int i=2;i<lastStorageSize; i++)
{
if(primeIndicators[i] == 0)
{
if(i%1000000 = 0)
std::cout << i << " ";
primes[primesCount]=i;
primesCount++;
}
}
}
Prime::Prime(int num)
{
noOfPrimes = 0;
allocateIndicatorArray(num);
generateIndicators();
allocatePrimesArray();
generatePrimeList();
}
int Prime::getPrime(int nthPrime)
{
if(nthPrime < lastStorageSize)
{
return primes[nthPrime-1];
}
else
{
std::cout << "insufficient primes found\n";
return -1;
}
}
int Prime::getNoOfPrimes()
{
return noOfPrimes;
}
Whilst I'm reading around has anybody got any insight on this?
edit For some reason I decided to start newing my primes list with lastStorageSize ints instead of noOfPrime! thanks to David Fischer for spotting that one!
I can now exceed 600000000 as an upper bound
The amount of memory you can use inside your program is limited by the lesser of the two: 1) the available virtual memory, 2) the available address space.
If you are compiling your program as a 32-bit executable on a platform with flat memory model, the absolute limit of addressable space for a single process is 4GB. In this situation it is completely irrelevant how much swap space you have available. You simply can't allocate more than 4GB in a flat-memory 32-bit program, even if you still have lots of free swap space. Moreover, a large chunk of those 4GB of available addresses will be reserved for system needs.
On such a 32-bit platform allocating a large amount of swap space does make sense, since it will let you run multiple processes at once. But it does nothing to overcome the 4GB address space barrier for each specific process.
Basically, think of it as a phone number availability problem: if some region uses 7-digit phone numbers, then once you run out of the available 7-digit phone numbers in that region, manufacturing more phones for that region no longer makes any sense - they won't be usable. By adding swap space you essentially "manufacturing phones". But you have already run out of available "phone numbers".
The same issue formally exists, of course, with flat-memory model 64-bit platforms. However, the address space of 64-bit platform is so huge, that it is no longer a bottleneck (you know, "64-bit should be enough for everyone" :) )
When you allocate the sieve,
void Prime::allocateIndicatorArray(int num)
{
try
{
primeIndicators = new bool[num];
}
catch(std::bad_alloc ba)
{
std::cout << "not enough memory :[";
}
lastStorageSize = num;
}
you set lastStorageSize to num, the given bound for the primes. Then you never change it, and
void Prime::allocatePrimesArray()
{
try
{
primes = new int[lastStorageSize];
}
catch(std::bad_alloc ba)
{
std::cout << "not enough memory :[";
}
}
try to allocate an int array of lastStorageSize elements.
If num is around 500 million, that's around 2 GB that you request. Depending on operating system/overcommitting strategy, that can easily cause a bad_alloc even though you only need a fraction of the space actually.
After the sieving is finished, you set noOfPrimes to the count of found primes - use that number to allocate the primes array.
Since the memory usage of the program is so easy to analyze, just let the memory layout be completely fixed. Don't dynamically allocate anything. Use std::bitset to get a fixed-size bitvector, and make that a global variable.
std::bitset< 600000000 > indicators; // 75 MB
This won't take up space on disk. The OS will just allocate pages of zeroes as you progress along the array. And it makes better use of each bit.
Of course, half the bits represent even numbers, despite there being only one even prime. Here are a couple prime generators that optimize out such things.
By the way, it's better to avoid explicitly writing new if possible, avoid calling functions from the constructor, and to rethrow the std::bad_alloc to avoid allowing the object to be constructed into an invalid state.
The first question is "what other processes are running?" The
2.87 GB of swap space is shared between all of the running
processes; it is not per process. And frankly, on a modern
system, 2.8 GB sounds fairly low to me. I wouldn't try to run
recent versions of Windows or Linux with less than 2GB ram and
4GB swap. (Recent versions of Linux, at least in the Ubuntu
distribution, especially, seem to start up a lot of daemons
which hog the memory.) You might want to try top, sorted on
virtual memory size, just to see how much other processes are
taking.
cat /proc/meminfo will also give you a lot of valuable
information about what is actually being used. (On my system,
running just a couple of xterm with bash, plus Firefox, I
have only 3623776 kB free, on a system with 8GB. Some of the
memory counted as used is probably things like disk caching,
which the system can scale back if an application requests
memory.)
Second, concerning your seg faults: by default, Linux doesn't
always report allways report allocation failures correctly; it
will often lie, telling you that you have the memory, when you
don't. Try cat /proc/sys/vm/overcommit_memory. If it
displays zero, then you need to change it. If this is the case,
try echo 2 > /proc/sys/vm/overcommit_memory (and do this in
one of the rc files). You may have to change the
/proc/sys/vm/overcommit_ratio as well to get reliable behavior
from sbrk (which both malloc and operator new depend on).
Related
I was looking to see how many elements I can stick into a vector before the program crashes. When running the code below the program crashed with a bad alloc at i=90811045, aka when trying to add the 90811045th element. My question is: Why 90811045?
it is:
not a power of two
not the value that vector.max_size() gives
the same number both in debug and release
the same number after restarting my computer
the same number regardless of what the value of the long long is
note: I know I can fix this by using vector.reserve() or other methods, I am just interested in where 90811045 comes from.
code used:
#include <iostream>
#include <vector>
int main() {
std::vector<long long> myLongs;
std::cout << "Max size expected : " << myLongs.max_size() << std::endl;
for (int i = 0; i < 160000000; i++) {
myLongs.push_back(i);
if (i % 10000 == 0) {
std::cout << "Still going! : " << i << " \r";
}
}
return 0;
}
extra info:
I am currently using 64 bit windows with 16 GB of ram.
Why 90811045?
It's probably just incidental.
That vector is not the only thing that uses memory in your process. There is the execution stack where local variables are stored. There is memory allocated by for buffering the input and output streams. Furthermore, the global memory allocator uses some of the memory for bookkeeping.
90811044 were added succesfully. The vector implementation (typically) has a deterministic strategy for allocating larger internal buffer. Typically, it multiplies the previous capacity by a constant factor (greater than 1). Hence, we can conclude that 90811044 * sizeof(long long) + other_usage is consistently small enough to be allocated successfully, but (90811044 * sizeof(long long)) * some_factor + other_usage is consistently too much.
My question is related to a problem described here. I have written a C++ implementation of the Sieve of Eratosthenes that hits a memory overflow if I set the target value too high. As suggested in that question, I am able to fix the problem by using a boolean <vector> instead of a normal array.
However, I am hitting the memory overflow at a much lower value than expected, around n = 1 200 000. The discussion in the thread linked above suggests that the normal C++ boolean array uses a byte for each entry, so with 2 GB of RAM, I expect to be able to get to somewhere on the order of n = 2 000 000 000. Why is the practical memory limit so much smaller?
And why does using <vector>, which encodes the booleans as bits instead of bytes, yield more than an eightfold increase in the computable limit?
Here is a working example of my code, with n set to a small value.
#include <iostream>
#include <cmath>
#include <vector>
using namespace std;
int main() {
// Count and sum of primes below target
const int target = 100000;
// Code I want to use:
bool is_idx_prime[target];
for (unsigned int i = 0; i < target; i++) {
// initialize by assuming prime
is_idx_prime[i] = true;
}
// But doesn't work for target larger than ~1200000
// Have to use this instead
// vector <bool> is_idx_prime(target, true);
for (unsigned int i = 2; i < sqrt(target); i++) {
// All multiples of i * i are nonprime
// If i itself is nonprime, no need to check
if (is_idx_prime[i]) {
for (int j = i; i * j < target; j++) {
is_idx_prime[i * j] = 0;
}
}
}
// 0 and 1 are nonprime by definition
is_idx_prime[0] = 0; is_idx_prime[1] = 0;
unsigned long long int total = 0;
unsigned int count = 0;
for (int i = 0; i < target; i++) {
// cout << "\n" << i << ": " << is_idx_prime[i];
if (is_idx_prime[i]) {
total += i;
count++;
}
}
cout << "\nCount: " << count;
cout << "\nTotal: " << total;
return 0;
}
outputs
Count: 9592
Total: 454396537
C:\Users\[...].exe (process 1004) exited with code 0.
Press any key to close this window . . .
Or, changing n = 1 200 000 yields
C:\Users\[...].exe (process 3144) exited with code -1073741571.
Press any key to close this window . . .
I am using the Microsoft Visual Studio interpreter on Windows with the default settings.
Turning the comment into a full answer:
Your operating system reserves a special section in the memory to represent the call stack of your program. Each function call pushes a new stack frame onto the stack. If the function returns, the stack frame is removed from the stack. The stack frame includes the memory for the parameters to your function and the local variables of the function. The remaining memory is referred to as the heap. On the heap, arbitrary memory allocations can be made, whereas the structure of the stack is governed by the control flow of your program. A limited amount of memory is reserved for the stack, when it gets full (e.g. due to too many nested function calls or due to too large local objects), you get a stack overflow. For this reason, large objects should be allocated on the heap.
General references on stack/heap: Link, Link
To allocate memory on the heap in C++, you can:
Use vector<bool> is_idx_prime(target);, which internally does a heap allocation and deallocates the memory for you when the vector goes out of scope. This is the most convenient way.
Use a smart pointer to manage the allocation: auto is_idx_prime = std::make_unique<bool[]>(target); This will also automatically deallocate the memory when the array goes out of scope.
Allocate the memory manually. I am mentioning this only for educational purposes. As mentioned by Paul in the comments, doing a manual memory allocation is generally not advisable, because you have to manually deallocate the memory again. If you have a large program with many memory allocations, inevitably you will forget to free some allocation, creating a memory leak. When you have a long-running program, such as a system service, creating repeated memory leaks will eventually fill up the entire memory (and speaking from personal experience, this absolutely does happen in practice). But in theory, if you would want to make a manual memory allocation, you would use bool *is_idx_prime = new bool[target]; and then later deallocate again with delete [] is_idx_prime.
Problem Statement is to find prime number below 2 billion in timeframe < 20 sec.
I followed below approaches.
Divide the number n with list of number k ( k < sqrt(n)) - took 20 sec
Divide the number n with list of prime number below sqrt(n).In this scenario I stored prime numbers in std::list - took more than 180 sec
Can someone help me understand why did 2nd approach take longtime even though we reduced no of divisions by 50%(approx)? or Did I choose wrong Data Structure?
Approach 1:
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 20000000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int j = 0;
int limit = sqrt(i);
for (j = 2 ; j <= limit;j++)
{
if(i % j == 0)
{
break;
}
}
if( j > limit)
{
primeno.push_back(i);
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
Approach 2 :
#include <iostream>
#include<list>
#include <ctime>
using namespace std;
list<long long> primeno;
int noofdiv = 0;
void ListPrimeNumber();
int main()
{
clock_t time_req = clock();
ListPrimeNumber();
time_req = clock() - time_req;
cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
cout << "No of divisions : " << noofdiv;
return 0;
}
void check_prime(int i);
void ListPrimeNumber()
{
primeno.push_back(2);
primeno.push_back(3);
primeno.push_back(5);
for (long long i = 6; i <= 10000; i++)
{
check_prime(i);
}
}
void check_prime(int i)
{
try
{
int limit = sqrt(i);
for (int iter : primeno)
{
noofdiv++;
if (iter <= limit && (i%iter) == 0)
{
break;
}
else if (iter > limit)
{
primeno.push_back(i);
break;
}
}
}
catch (exception ex)
{
std::cout << "Message";
}
}
The reason your second example takes longer is you're iterating a std::list.
A std::list in C++ is a linked list, which means it doesn't use contiguous memory. This is bad because to iterate the list you must jump from node to node in a (to the CPU/prefetcher) unpredictable way. Also, You're most likely only "using" a few bytes of each cacheline. RAM is slow. Fetching a byte from RAM takes a lot longer than fetching it from L1. CPUs are fast these days, so your program is most of the time not doing anything and waiting for memory to arrive.
Use a std::vector instead. It stores all values one after the other and iterating is very cheap. Since you're iterating forward in memory without jumping, you're using the full cacheline and your prefetcher will be able to fetch further pages before you need them because your access of memory is predictable.
It has been proven by numerous people, including Bjarne Stroustrup, that std::vector is in a lot of cases faster than std::list, even in cases where the std::list has "theoretically" better complexity (random insert, delete, ...) just because caching helps a lot. So always use std::vector as your default. And if you think a linked list would be faster in your case, measure it and be surprised that - most of the time - std::vector dominates.
Edit: as others have noted, your method of finding primes isn't very efficient. I just played around a bit and implemented a Sieve of Eratosthenes using a bitset.
constexpr int max_prime = 1000000000;
std::bitset<max_prime> *bitset = new std::bitset<max_prime>{};
// Note: Bit SET means NO prime
bitset->set(0);
bitset->set(1);
for(int i = 4; i < max_prime ; i += 2)
bitset->set(i); // set all even numbers
int max = sqrt(max_prime);
for(int i = 3; i < max; i += 2) { // No point testing even numbers as they can't be prime
if(!bitset->test(i)) { // If i is prime
for(int j = i * 2; j < no_primes; j += i)
bitset->set(j); // set all multiples of i to non-prime
}
}
This takes between 4.2 and 4.5 seconds 30 seconds (not sure why it changed that much after slight modifications... must be an optimization I'm not hitting anymore) to find all primes below one Billion (1,000,000,000) on my machine. Your approach took way too long even for 100 million. I cancelled the 1 Billion search after about two minutes.
Comparison for 100 million:
time taken: 63.515 seconds
time taken bitset: 1.874 seconds
No of divisions : 1975961174
No of primes found: 5761455
No of primes found bitset: 5761455
I'm not a mathematician so I'm pretty sure there are still ways to optimize it further, I only optimize for even numbers.
The first thing to do is make sure you are compiling with optimisations enabled. The c++ standard library template classes tend to perform very poorly with unoptimised code as they generate lots of function calls. The optimiser inlines most of these function calls which makes them much cheaper.
std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).
For the case where you are only appending to the end of a list std::list has the following issues:
Iterating through the list is relatively expensive as the code has to follow node pointers and then retrieve the data
The list uses quite a lot more memory, each element needs a pointer to the previous and next nodes in addition to the actual data. On a 64-bit system this equates to 20 bytes per element rather than 4 for a list of int
As the elements in the list are not contiguous in memory the compiler can't perform as many SIMD optimisations and you will suffer more from CPU cache misses
A std::vector would solve all of the above as its memory is contiguous and iterating through it is basically just a case of incrementing an array index. You do need to make sure that you call reserve on your vector at the beginning with a sufficiently large value so that appending to the vector doesn't cause the whole array to be copied to a new larger array.
A bigger optimisation than the above would be to use the Sieve of Eratosthenes to calculate your primes. As generating this light require random deletions (depending on your exact implementation) std::list might perform better than std::vector though even in this case the overheads of std::list might not outweigh its costs.
A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question:
/* check_prime__list:
time taken No of divisions No of primes
10M: 0.873 seconds 286144936 664579
20M: 2.169 seconds 721544444 1270607 */
2B: projected time: at least 16 minutes but likely much more (*)
/* check_prime__nums:
time taken No of divisions No of primes
10M: 4.650 seconds 1746210131 664579
20M: 12.585 seconds 4677014576 1270607 */
2B: projected time: at least 3 hours but likely much more (*)
I also changed the type of the number of divisions counter to long int because it was wrapping around the data type limit. So they could have been misinterpreting that.
But the run time wasn't being affected by that. A wall clock is a wall clock.
Most likely explanation seems to be a sloppy testing by the OP, with different values used in each test case, by mistake.
(*) The time projection was made by the empirical orders of growth analysis:
100**1.32 * 2.169 / 60 = 15.8
100**1.45 * 12.585 / 3600 = 2.8
Empirical orders of growth, as measured on the given range of sizes, were noticeably better for the list algorithm, n1.32 vs. the n1.45 for the testing by all numbers. This is entirely expected from theoretical complexity, since there are fewer primes than all numbers up to n, by a factor of log n, for a total complexity of O(n1.5/log n) vs. O(n1.5). It is also highly unlikely for any implementational discrepancy to beat an actual algorithmic advantage.
I'm a student, and my Operating Systems class project has a little snag, which is admittedly a bit superfluous to the assignment specifications itself:
While I can push 1 million deques into my deque of deques, I cannot push ~10 million or more.
Now, in the actual program, there is lots of stuff going on, and the only thing already asked on Stack Overflow with even the slightest relevance had exactly that, only slight relevance. https://stackoverflow.com/a/11308962/3407808
Since that answer had focused on "other functions corrupting the heap", I isolated the code into a new project and ran that separately, and found everything to fail in exactly the same ways.
Here's the code itself, stripped down and renamed for the sake of space.
#include <iostream>
#include <string>
#include <sstream>
#include <deque>
using namespace std;
class cat
{
cat();
};
bool number_range(int lower, int upper, double value)
{
while(true)
{
if(value >= lower && value <= upper)
{
return true;
}
else
{
cin.clear();
cerr << "Value not between " << lower << " and " << upper << ".\n";
return false;
}
}
}
double get_double(char *message, int lower, int upper)
{
double out;
string in;
while(true) {
cout << message << " ";
getline(cin,in);
stringstream ss(in); //convert input to stream for conversion to double
if(ss >> out && !(ss >> in))
{
if (number_range(lower, upper, out))
{
return out;
}
}
//(ss >> out) checks for valid conversion to double
//!(ss >> in) checks for unconverted input and rejects it
cin.clear();
cerr << "Value not between " << lower << " and " << upper << ".\n";
}
}
int main()
{
int dq_amount = 0;
deque<deque <cat> > dq_array;
deque<cat> dq;
do {
dq_amount = get_double("INPUT # OF DEQUES: ", 0, 99999999);
for (int i = 0; i < number_of_printers; i++)
{
dq_array.push_back(dq);
}
} while (!number_range(0, 99999999, dq_amount));
}
In case that's a little obfuscated, the design (just in case it's related to the error) is that my program asks for you to input an integer value. It takes your input and verifies that it can be read as an integer, and then further parses it to ensure it is within certain numerical bounds. Once it's found within bounds, I push deques of myClass into a deque of deques of myClass, for the amount of times equal to the user's input.
This code has been working for the past few weeks that I've being making this project, but my upper bound had always been 9999, and I decided to standardize it with most of the other inputs in my program, which is an appreciably large 99,999,999. Trying to run this code with 9999 as the user input works fine, even with 99999999 as the upper bound. The issue is a runtime error that happens if the user input is 9999999+.
Is there any particular, clear reason why this doesn't work?
Oh, right, the error message itself from Code::Blocks 13.12:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's supprt team for more information.
Process returned 3 (0x3) execution time : 12.559 s
Press any key to continue.
I had screenshots, but I need to be 10+ reputation in order to put images into my questions.
This looks like address space exhaustion.
If you are compiling for a 32-bit target, you will generally be limited to 2 GiB of user-mode accessible address space per process, or maybe 3 GiB on some platforms. (The remainder is reserved for kernel-mode mappings shared between processes)
If you are running on a 64-bit platform and build a 64-bit binary, you should be able to do substantially more new/alloc() calls, but be advised you may start hitting swap.
Alternatively, you might be hitting a resource quota even if you are building a 64-bit binary. On Linux you can check ulimit -d to see if you have a per-process memory limit.
when using C++ vector, time spent is 718 milliseconds,
while when I use Array, time is almost 0 milliseconds.
Why so much performance difference?
int _tmain(int argc, _TCHAR* argv[])
{
const int size = 10000;
clock_t start, end;
start = clock();
vector<int> v(size*size);
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
v[i*size+j] = 1;
}
}
end = clock();
cout<< (end - start)
<<" milliseconds."<<endl; // 718 milliseconds
int f = 0;
start = clock();
int arr[size*size];
for(int i = 0; i < size; i++)
{
for(int j = 0; j < size; j++)
{
arr[i*size+j] = 1;
}
}
end = clock();
cout<< ( end - start)
<<" milliseconds."<<endl; // 0 milliseconds
return 0;
}
Your array arr is allocated on the stack, i.e., the compiler has calculated the necessary space at compile time. At the beginning of the method, the compiler will insert an assembler statement like
sub esp, 10000*10000*sizeof(int)
which means the stack pointer (esp) is decreased by 10000 * 10000 * sizeof(int) bytes to make room for an array of 100002 integers. This operation is almost instant.
The vector is heap allocated and heap allocation is much more expensive. When the vector allocates the required memory, it has to ask the operating system for a contiguous chunk of memory and the operating system will have to perform significant work to find this chunk of memory.
As Andreas says in the comments, all your time is spent in this line:
vector<int> v(size*size);
Accessing the vector inside the loop is just as fast as for the array.
For an additional overview see e.g.
[What and where are the stack and heap?
[http://computer.howstuffworks.com/c28.htm][2]
[http://www.cprogramming.com/tutorial/virtual_memory_and_heaps.html][3]
Edit:
After all the comments about performance optimizations and compiler settings, I did some measurements this morning. I had to set size=3000 so I did my measurements with roughly a tenth of the original entries. All measurements performed on a 2.66 GHz Xeon:
With debug settings in Visual Studio 2008 (no optimization, runtime checks, and debug runtime) the vector test took 920 ms compared to 0 ms for the array test.
98,48 % of the total time was spent in vector::operator[], i.e., the time was indeed spent on the runtime checks.
With full optimization, the vector test needed 56 ms (with a tenth of the original number of entries) compared to 0 ms for the array.
The vector ctor required 61,72 % of the total application running time.
So I guess everybody is right depending on the compiler settings used. The OP's timing suggests an optimized build or an STL without runtime checks.
As always, the morale is: profile first, optimize second.
If you are compiling this with a Microsoft compiler, to make it a fair comparison you need to switch off iterator security checks and iterator debugging, by defining _SECURE_SCL=0 and _HAS_ITERATOR_DEBUGGING=0.
Secondly, the constructor you are using initialises each vector value with zero, and you are not memsetting the array to zero before filling it. So you are traversing the vector twice.
Try:
vector<int> v;
v.reserve(size*size);
To get a fair comparison I think something like the following should be suitable:
#include <sys/time.h>
#include <vector>
#include <iostream>
#include <algorithm>
#include <numeric>
int main()
{
static size_t const size = 7e6;
timeval start, end;
int sum;
gettimeofday(&start, 0);
{
std::vector<int> v(size, 1);
sum = std::accumulate(v.begin(), v.end(), 0);
}
gettimeofday(&end, 0);
std::cout << "= vector =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
gettimeofday(&start, 0);
int * const arr = new int[size];
std::fill(arr, arr + size, 1);
sum = std::accumulate(arr, arr + size, 0);
delete [] arr;
gettimeofday(&end, 0);
std::cout << "= Simple array =" << std::endl
<< "(" << end.tv_sec - start.tv_sec
<< " s, " << end.tv_usec - start.tv_usec
<< " us)" << std::endl
<< "sum = " << sum << std::endl << std::endl;
}
In both cases, dynamic allocation and deallocation is performed, as well as accesses to elements.
On my Linux box:
$ g++ -O2 foo.cpp
$ ./a.out
= vector =
(0 s, 21085 us)
sum = 7000000
= Simple array =
(0 s, 21148 us)
sum = 7000000
Both the std::vector<> and array cases have comparable performance. The point is that std::vector<> can be just as fast as a simple array if your code is structured appropriately.
On a related note switching off optimization makes a huge difference in this case:
$ g++ foo.cpp
$ ./a.out
= vector =
(0 s, 120357 us)
sum = 7000000
= Simple array =
(0 s, 60569 us)
sum = 7000000
Many of the optimization assertions made by folks like Neil and jalf are entirely correct.
HTH!
EDIT: Corrected code to force vector destruction to be included in time measurement.
Change assignment to eg. arr[i*size+j] = i*j, or some other non-constant expression. I think compiler optimizes away whole loop, as assigned values are never used, or replaces array with some precalculated values, so that loop isn't even executed and you get 0 milliseconds.
Having changed 1 to i*j, i get the same timings for both vector and array, unless pass -O1 flag to gcc, then in both cases I get 0 milliseconds.
So, first of all, double-check whether your loops are actually executed.
You are probably using VC++, in which case by default standard library components perform many checks at run-time (e.g whether index is in range). These checks can be turned off by defining some macros as 0 (I think _SECURE_SCL).
Another thing is that I can't even run your code as is: the automatic array is way too large for the stack. When I make it global, then with MingW 3.5 the times I get are 627 ms for the vector and 26875 ms (!!) for the array, which indicates there are really big problems with an array of this size.
As to this particular operation (filling with value 1), you could use the vector's constructor:
std::vector<int> v(size * size, 1);
and the fill algorithm for the array:
std::fill(arr, arr + size * size, 1);
Two things. One, operator[] is much slower for vector. Two, vector in most implementations will behave weird at times when you add in one element at a time. I don't mean just that it allocates more memory but it does some genuinely bizarre things at times.
The first one is the main issue. For a mere million bytes, even reallocating the memory a dozen times should not take long (it won't do it on every added element).
In my experiments, preallocating doesn't change its slowness much. When the contents are actual objects it basically grinds to a halt if you try to do something simple like sort it.
Conclusion, don't use stl or mfc vectors for anything large or computation heavy. They are implemented poorly/slowly and cause lots of memory fragmentation.
When you declare the array, it lives in the stack (or in static memory zone), which it's very fast, but can't increase its size.
When you declare the vector, it assign dynamic memory, which it's not so fast, but is more flexible in the memory allocation, so you can change the size and not dimension it to the maximum size.
When profiling code, make sure you are comparing similar things.
vector<int> v(size*size);
initializes each element in the vector,
int arr[size*size];
doesn't. Try
int arr[size * size];
memset( arr, 0, size * size );
and measure again...