2D vector with very large size - c++

i have defined a 2 dimention vector in c++ that its sizes are too large. the voctor definition is like this:
vector<vector<string> > CommunityNodes(3600, vector<string>(240005));
when i run the program, i have the following error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
before i run the program, i take the following command in console line that specify the stack size would be unlimited:
ulimit -s unlimited
but again i have the allocation error.how can i define such big vector in c++?

Let's assume an implementation of std::string containing one 32-bit pointer to its contents, one 32-bit int for the current length of the string, and one 32-bit int for the current allocation size.
That gives us 12 bytes per string * 240005 strings per row * 3600 rows, which works out to 9.7 gigabytes -- considerably more than you can deal with on a 32-bit implementation. Worse, an implementation might easily pad that 12-byte string out to 16 bytes, thus increasing the memory needed still more.
If you go to a 64-bit implementation, you can address more memory, but the size is likely to double, so you'd need roughly 20 gigabytes of memory just to store the arrays of empty strings. Add some actual contents, and you need even more (and, again, the string could easily be padded to be larger still).
So, yes, with a 64-bit implementation this can probably be made to work, but it's impractical for most purposes. You probably want/need to find some way to reduce the number of strings you use.

An attempt to pre-allocate that much space suggests that you are taking the wrong approach to the problem. I suspect that most of the nodes in the double vector are going to be empty. If that is the case, you would be much better off with something like a double map instead of a double vector:
std::map< int, std::map< int, std::string > > CommunityNodes();
Then you only create the entries you must have, instead of preallocating the entire array.
Please Note: using the '[]' operator will automatically create a node, even if you don't assign anything to it. If you don't want to create the node, then use 'find()' instead of '[]'.

Related

C++ creating huge vector

For a process I'm trying to run I need to have a std::vector of std::tuple<long unsigned int, long unsigned int>. The test I'm doing right now should create a vector of 47,614,527,250 (around 47 billion) tuples but actually crashes right there on creation with the error terminate called after throwing an instance of 'std::bad_alloc'. My goal is to use this script with a vector roughly twice that size. The code is this:
arc_vector = std::vector<std::tuple<long unsigned int, long unsigned int>>(arcs);
where arcs is a long unsigned int with the cited value.
Can I, and in that case how do I, increase the memory size? This script is running on a 40-core machine with something like 200GB of memory so I know memory itself is not an issue.
47 billion tuples times 16 bytes each tuple is 780 billion bytes, which is about 760 gb. Your machine has less than 1/3 of the memory required for that, so you really need another approach, regardless of the reason your program crashes.
A proposal I can give you is to use a memory mapped file of 1TB to store that array, and if you really need to use a vector as interface you might write a custom allocator for it that uses the mapped memory. That should sort out your lack of main memory in a quasi-transparent way. If your interface requires a standard vector, with standard allocators, you are better re-designing that.
Another point to add, check what value you have for ulimit for the user running the process, because it might have a more strict limit of virtual memory than 760 gb.
You may well have a machine with a lot of memory but the problem is that you require that memory to be contiguous.
Even with memory virtualisation, that's unlikely.
For that amount of data, you'll need to use a different storage container. You could roll your own based on a linked list of vectors that subdivide the data, a vector of pointers to subdivided vectors of your tuples, or find a library that has such a construction already built.

Limit on vectors in c++

I have a question regarding vectors used in c++. I know unlike array there is no limit on vectors. I have a graph with 6 million vertices and I am using vector of class. When I am trying to insert nodes into vector it is failing by saying bad memory allocation. where as it is working perfectly over 2 million nodes. I know bad allocation means it s failing due to pointers I am using in my code but to me this does not seems the case. My question is it possible that it is failing due to the large size of graph as limit on vector is increased. If it is is there any way we can increase that Limit.
First of all you should verify how much memory a single element requires. What is the size of one vertex/node? (You can verify that by using the sizeof operator). Consider that if the answer is, say, 50 bytes, you need 50 bytes times 6 million vertices = 300 MBytes.
Then, consider the next problem: in a vector the memory must be contiguous. This means your program will ask the OS to give it a contiguous chunk of 300 MBytes, and there's no guarantee this chunk is available even if the available memory is more than 300 MB. You might have to split your data, or to choose another, non-contiguous container. RAM fragmentation is impossible to control, which means if you run your program and it works, maybe you run it again and it doesn't work (or vice versa).
Another possible approach is to resize the vector manually, instead of letting it choose its new size automatically. The vector tries to anticipate some future growth, so if it has to grow it will try to allocate more capacity than is needed. This extra capacity might be the difference between having enough memory and not having it. You can use std::vector::reserve for this, though I think the exact behaviour is implementation dependent - it might still decide to reserve more than the amount you have requested.
One more option you have is to optimize the data types you are using. For example, if inside your vertex class you are using 32-bit integers while you only need 16 bits, you might use int16_t which would take half the space. See the full list of fixed size variables at CPP Reference.
There is std::vector::max_size that you can use to see the maximum number of elements the the vector you declared can potentially hold.
Return maximum size
Returns the maximum number of elements that the
vector can hold.
This is the maximum potential size the container can reach due to
known system or library implementation limitations, but the container
is by no means guaranteed to be able to reach that size: it can still
fail to allocate storage at any point before that size is reached.

Freeze in C++ program using huge vector

I have an issue with a C++ program. I think it's a problem of memory.
In my program i'm used to create some enormous std::vector (i use reserve to allocate some memory). With vector size of 1 000 000, it's ok but if i increase this number (about ten millions), my program will freeze my PC and i can do nothing except waiting for a crash (or end of the program if i'm lucky). My vector contains a structure called Point which contain a vector of double.
I used valgrind to check if there is a memory lack. But no. According to it, there is no problem. Maybe using a vector of objects is not advised ? Or maybe is there some system parameters to check or something ? Or simply, the vector is too big for the computer ?
What do you think about this ?
Disclaimer
Note that this answer assumes a few things about your machine; the exact memory usage and error potential depends on your environment. And of course it is even easier to crash when you don't compute on 2d-Points, but e.g. 4d-points, which are common in computer graphics for example, or even larger Points for other numeric purposes.
About your problem
That's quite a lot of memory to allocate:
#include <iostream>
#include <vector>
struct Point {
std::vector<double> coords;
};
int main () {
std::cout << sizeof(Point) << std::endl;
}
This prints 12, which is the size in bytes of an empty Point. If you have 2-dimensional points, add another 2*sizeof(double)=8 to that per element, i.e. you now have a total of 20 bytes per Point.
With 10s of millions of elements, you request 200s of millions of bytes of data, e.g. for 20 million elements, you request 400 million bytes. While this does not exceed the maximum index into an std::vector, it is possible that the OS does not have that much contiguous memory free for you.
Also, your vectors memory needs to be copied quite often in order to be able to grow. This happens for example when you push_back, so when you already have a 400MiB vector, upon the next push_back you might have your old version of the vector, plus the newly allocated 400MiB*X memory, so you may easily exceed the 1000MiB temporarilly, et cetera.
Optimizations (high level; preferred)
Do you need to actually store the data all time? Can you use a similar algorithm which does not require so much storage? Can you refactor your code so that storage is reduced? Can you core some data out when you know it will take some time until you need it again?
Optimizations (low level)
If you know the number of elements before creating your outer vector, use the std::vector constructor which you can tell an initial size:
vector<Foo> foo(12) // initialize have 12 elements
Of course you can optimize a lot for memory; e.g. if you know you always only have 2d-Points, just have two doubles as members: 20 bytes -> 16 bytes. When you do not really need the precision of double, use float: 16 bytes -> 8 bytes. That's an optimization to $2/5$:
// struct Point { std::vector<double> coords; }; <-- old
struct Point { float x, y; }; // <-- new
If this is still not enough, an ad-hoc solution could be std::deque, or another, non-contiguous container: No temporal memory "doubling" because no resizing needed; also no need for the OS to find you such contiguous block of memory.
You can also use compression mechanisms, or indexed data, or fixed point numbers. But it depends on your exact circumstances.
struct Point { signed char x, y; }; // <-- or even this? examine a proper type
struct Point { short x_index, y_index; };
Without seeing your code, this is just speculation, but I suspect it is in large part due to your attempt to allocate a massive amount of memory that is contiguous. std::vector is guaranteed to be in contiguous memory, so if you try to allocate a large amount of space, the operating system has to try to find a block of memory that large that it can use. This may not be a problem for 2MB, but if you are suddenly trying to allocate 200MB or 2GB of contiguous memory ...
Additionally, anytime you add a new element to the vector and it is forced to resize, all of the existing elements must be copied into the new space allocated. If you have 9 million elements and adding the 9,000,001 element requires a resize, that is 9 million elements that have to be moved. As your vector gets larger, this copy time takes longer.
Try using std::deque instead. It is will basically allocate pages (that will be contiguous), but each page can be allocated wherever it can fit.

200 millions items at Vector

Here is my structure
struct Node
{
int chrN;
long int pos;
int nbp;
string Ref;
string Alt;
};
to fill the structure I read though a file and pars my interest variable to the structure and then push it back to a vectore. The problem is, there are around 200 millions items and I should keep all of them at memory (for the further steps)! But the program terminated after pushing back 50 millions of nodes with bad_allocation error.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
searching around give me the idea I'm out of memory! but the output of top shows %48 (when the termination happened)
Additional information which may be useful:
I set the stack limitation unlimit and
I'm using Ubuntu x86_64 x86_64 x86_64 GNU/Linux with 4Gb RAM.
Any help whuold be most welcome.
Update:
1st switch from vector to list, then store each ~500Mb at file and index them for further analyses.
Vector storage is contiguous, in this case, 200 mio * the sizeof the struct bytes are required. For each of the strings in the struct, another mini allocation may be needed to hold the string. All together, this is not going to fit your available address space, and no (non-compressing) data structure is going to solve this.
Vectors usually grow their backing capacity exponentially (which amortizes the cost for push_back). So when your program was already using about half the available address space, the vector probably attempted to double its size (or add 50%), which then caused the bad_alloc and it did not free the previous buffer, so the final memory appears to be only 48%.
That node structure consumes up to 44 bytes, plus the actual string buffers. There's no way 200 million of them will fit in 4 GB.
You need to not hold your entire dataset in memory at once.
Since vectors must store all elements in contiguous memory, you are much more likely to run out of memory before you consume your full available RAM.
Try using a std::list and see if that can retain all of your items. You won't be able to randomly access the elements, but that is a tradeoff you are most likely going to have to deal with.
The std::list can better utilize free fragments of RAM, since unlike the vector, it doesn't try to store the elements adjacent to one another.
Depending on the size of the data types, I'd guess, the structure you are using is at least 4+8+4+2+2=20 Bytes long. If you have 200 000 000 of data fields, this will be around 3,8 gb of data. Not sure what you read from top, but this is close to your memory limit.
As LatencyMachine noted, the items have to be in a contiguous memory block, which will be difficult (the string memory can be somewhere else, but the two bytes I summed up will have to be in the vector).
It might help to initialize the vector with the correct size to avoid reallocation.
If you have a look at this code:
#include <iostream>
using namespace std;
struct Node
{
int chrN;
long int pos;
int nbp;
string Ref;
string Alt;
};
int main() {
// your code goes here
cout << sizeof(Node) << endl;
return 0;
}
And the result it produces on ideone you will find out that the size of your structure even if the strings are empty and on 32 bit computer is 20. Thus 200 * 10^6 times this size makes exactly 4GB. You can't hope to have the whole memory just for you. So your program will use virtual memory like crazy. You have to think of a way to store the elements only partially or your program will be in huge trouble.

Declaring large character array in c++

I am trying right now to declare a large character array. I am using the character array as a bitmap (as in a map of booleans, not the image file type). The following code generates a compilation error.
//This is code before main. I want these as globals.
unsigned const long bitmap_size = (ULONG_MAX/(sizeof(char)));
char bitmap[bitmap_size];
The error is overflow in array dimension. I recognize that I'm trying to have my process consume a lot of data and that there might be some limit in place that prevents me from doing so. I am curious as to whether I am making a syntax error or if I need to request more resources from the kernel. Also, I have no interest in creating a bitmap with some class. Thank you for your time.
EDIT
ULONG_MAX is very much dependent upon the machine that you are using. On the particular machine I was compiling my code on it was well over 4.2 billion. All in all, I wouldn't to use that constant like a constant, at least for the purpose of memory allocation.
ULONG_MAX/sizeof(char) is the same as ULONG_MAX, which is a very large number. So large, in fact, that you don't have room for it even in virtual memory (because ULONG_MAX is probably the number of bytes in your entire virtual memory).
You definitely need to rethink what you are trying to do.
It's impossible to declare an array that large on most systems -- on a 32-bit system, that array is 4 GB, which doesn't fit into the available address space, and on most 64-bit systems, it's 16 exabytes (16 million terabytes), which doesn't fit into the available address space there either (and, incidentally, may be more memory than exists on the entire planet).
Use malloc() to allocate large amounts of memory. But be realistic. :)
As I understand it, the maximum size of an array in c++ is the largest integer the platform supports. It is likely that your long-type bitmap_size constant exceeds that limit.