C++ creating huge vector - c++

For a process I'm trying to run I need to have a std::vector of std::tuple<long unsigned int, long unsigned int>. The test I'm doing right now should create a vector of 47,614,527,250 (around 47 billion) tuples but actually crashes right there on creation with the error terminate called after throwing an instance of 'std::bad_alloc'. My goal is to use this script with a vector roughly twice that size. The code is this:
arc_vector = std::vector<std::tuple<long unsigned int, long unsigned int>>(arcs);
where arcs is a long unsigned int with the cited value.
Can I, and in that case how do I, increase the memory size? This script is running on a 40-core machine with something like 200GB of memory so I know memory itself is not an issue.

47 billion tuples times 16 bytes each tuple is 780 billion bytes, which is about 760 gb. Your machine has less than 1/3 of the memory required for that, so you really need another approach, regardless of the reason your program crashes.
A proposal I can give you is to use a memory mapped file of 1TB to store that array, and if you really need to use a vector as interface you might write a custom allocator for it that uses the mapped memory. That should sort out your lack of main memory in a quasi-transparent way. If your interface requires a standard vector, with standard allocators, you are better re-designing that.
Another point to add, check what value you have for ulimit for the user running the process, because it might have a more strict limit of virtual memory than 760 gb.

You may well have a machine with a lot of memory but the problem is that you require that memory to be contiguous.
Even with memory virtualisation, that's unlikely.
For that amount of data, you'll need to use a different storage container. You could roll your own based on a linked list of vectors that subdivide the data, a vector of pointers to subdivided vectors of your tuples, or find a library that has such a construction already built.

Related

Limit on vectors in c++

I have a question regarding vectors used in c++. I know unlike array there is no limit on vectors. I have a graph with 6 million vertices and I am using vector of class. When I am trying to insert nodes into vector it is failing by saying bad memory allocation. where as it is working perfectly over 2 million nodes. I know bad allocation means it s failing due to pointers I am using in my code but to me this does not seems the case. My question is it possible that it is failing due to the large size of graph as limit on vector is increased. If it is is there any way we can increase that Limit.
First of all you should verify how much memory a single element requires. What is the size of one vertex/node? (You can verify that by using the sizeof operator). Consider that if the answer is, say, 50 bytes, you need 50 bytes times 6 million vertices = 300 MBytes.
Then, consider the next problem: in a vector the memory must be contiguous. This means your program will ask the OS to give it a contiguous chunk of 300 MBytes, and there's no guarantee this chunk is available even if the available memory is more than 300 MB. You might have to split your data, or to choose another, non-contiguous container. RAM fragmentation is impossible to control, which means if you run your program and it works, maybe you run it again and it doesn't work (or vice versa).
Another possible approach is to resize the vector manually, instead of letting it choose its new size automatically. The vector tries to anticipate some future growth, so if it has to grow it will try to allocate more capacity than is needed. This extra capacity might be the difference between having enough memory and not having it. You can use std::vector::reserve for this, though I think the exact behaviour is implementation dependent - it might still decide to reserve more than the amount you have requested.
One more option you have is to optimize the data types you are using. For example, if inside your vertex class you are using 32-bit integers while you only need 16 bits, you might use int16_t which would take half the space. See the full list of fixed size variables at CPP Reference.
There is std::vector::max_size that you can use to see the maximum number of elements the the vector you declared can potentially hold.
Return maximum size
Returns the maximum number of elements that the
vector can hold.
This is the maximum potential size the container can reach due to
known system or library implementation limitations, but the container
is by no means guaranteed to be able to reach that size: it can still
fail to allocate storage at any point before that size is reached.

2D vector with very large size

i have defined a 2 dimention vector in c++ that its sizes are too large. the voctor definition is like this:
vector<vector<string> > CommunityNodes(3600, vector<string>(240005));
when i run the program, i have the following error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
before i run the program, i take the following command in console line that specify the stack size would be unlimited:
ulimit -s unlimited
but again i have the allocation error.how can i define such big vector in c++?
Let's assume an implementation of std::string containing one 32-bit pointer to its contents, one 32-bit int for the current length of the string, and one 32-bit int for the current allocation size.
That gives us 12 bytes per string * 240005 strings per row * 3600 rows, which works out to 9.7 gigabytes -- considerably more than you can deal with on a 32-bit implementation. Worse, an implementation might easily pad that 12-byte string out to 16 bytes, thus increasing the memory needed still more.
If you go to a 64-bit implementation, you can address more memory, but the size is likely to double, so you'd need roughly 20 gigabytes of memory just to store the arrays of empty strings. Add some actual contents, and you need even more (and, again, the string could easily be padded to be larger still).
So, yes, with a 64-bit implementation this can probably be made to work, but it's impractical for most purposes. You probably want/need to find some way to reduce the number of strings you use.
An attempt to pre-allocate that much space suggests that you are taking the wrong approach to the problem. I suspect that most of the nodes in the double vector are going to be empty. If that is the case, you would be much better off with something like a double map instead of a double vector:
std::map< int, std::map< int, std::string > > CommunityNodes();
Then you only create the entries you must have, instead of preallocating the entire array.
Please Note: using the '[]' operator will automatically create a node, even if you don't assign anything to it. If you don't want to create the node, then use 'find()' instead of '[]'.

Freeze in C++ program using huge vector

I have an issue with a C++ program. I think it's a problem of memory.
In my program i'm used to create some enormous std::vector (i use reserve to allocate some memory). With vector size of 1 000 000, it's ok but if i increase this number (about ten millions), my program will freeze my PC and i can do nothing except waiting for a crash (or end of the program if i'm lucky). My vector contains a structure called Point which contain a vector of double.
I used valgrind to check if there is a memory lack. But no. According to it, there is no problem. Maybe using a vector of objects is not advised ? Or maybe is there some system parameters to check or something ? Or simply, the vector is too big for the computer ?
What do you think about this ?
Disclaimer
Note that this answer assumes a few things about your machine; the exact memory usage and error potential depends on your environment. And of course it is even easier to crash when you don't compute on 2d-Points, but e.g. 4d-points, which are common in computer graphics for example, or even larger Points for other numeric purposes.
About your problem
That's quite a lot of memory to allocate:
#include <iostream>
#include <vector>
struct Point {
std::vector<double> coords;
};
int main () {
std::cout << sizeof(Point) << std::endl;
}
This prints 12, which is the size in bytes of an empty Point. If you have 2-dimensional points, add another 2*sizeof(double)=8 to that per element, i.e. you now have a total of 20 bytes per Point.
With 10s of millions of elements, you request 200s of millions of bytes of data, e.g. for 20 million elements, you request 400 million bytes. While this does not exceed the maximum index into an std::vector, it is possible that the OS does not have that much contiguous memory free for you.
Also, your vectors memory needs to be copied quite often in order to be able to grow. This happens for example when you push_back, so when you already have a 400MiB vector, upon the next push_back you might have your old version of the vector, plus the newly allocated 400MiB*X memory, so you may easily exceed the 1000MiB temporarilly, et cetera.
Optimizations (high level; preferred)
Do you need to actually store the data all time? Can you use a similar algorithm which does not require so much storage? Can you refactor your code so that storage is reduced? Can you core some data out when you know it will take some time until you need it again?
Optimizations (low level)
If you know the number of elements before creating your outer vector, use the std::vector constructor which you can tell an initial size:
vector<Foo> foo(12) // initialize have 12 elements
Of course you can optimize a lot for memory; e.g. if you know you always only have 2d-Points, just have two doubles as members: 20 bytes -> 16 bytes. When you do not really need the precision of double, use float: 16 bytes -> 8 bytes. That's an optimization to $2/5$:
// struct Point { std::vector<double> coords; }; <-- old
struct Point { float x, y; }; // <-- new
If this is still not enough, an ad-hoc solution could be std::deque, or another, non-contiguous container: No temporal memory "doubling" because no resizing needed; also no need for the OS to find you such contiguous block of memory.
You can also use compression mechanisms, or indexed data, or fixed point numbers. But it depends on your exact circumstances.
struct Point { signed char x, y; }; // <-- or even this? examine a proper type
struct Point { short x_index, y_index; };
Without seeing your code, this is just speculation, but I suspect it is in large part due to your attempt to allocate a massive amount of memory that is contiguous. std::vector is guaranteed to be in contiguous memory, so if you try to allocate a large amount of space, the operating system has to try to find a block of memory that large that it can use. This may not be a problem for 2MB, but if you are suddenly trying to allocate 200MB or 2GB of contiguous memory ...
Additionally, anytime you add a new element to the vector and it is forced to resize, all of the existing elements must be copied into the new space allocated. If you have 9 million elements and adding the 9,000,001 element requires a resize, that is 9 million elements that have to be moved. As your vector gets larger, this copy time takes longer.
Try using std::deque instead. It is will basically allocate pages (that will be contiguous), but each page can be allocated wherever it can fit.

Casting size_t to allow more elements in a std::vector

I need to store a huge number of elements in a std::vector (more that the 2^32-1 allowed by unsigned int) in 32 bits. As far as I know this quantity is limited by the std::size_t unsigned int type. May I change this std::size_t by casting to an unsigned long? Would it resolve the problem?
If that's not possible, suppose I compile in 64 bits. Would that solve the problem without any modification?
size_t is a type that can hold size of any allocable chunk of memory. It follows that you can't allocate more memory than what fits in your size_t and thus can't store more elements in any way.
Compiling in 64-bits will allow it, but realize that the array still needs to fit in memory. 232 is 4 billion, so you are going to go over 4 * sizeof(element) GiB of memory. More than 8 GiB of RAM is still rare, so that does not look reasonable.
I suggest replacing the vector with the one from STXXL. It uses external storage, so your vector is not limited by amount of RAM. The library claims to handle terabytes of data easily.
(edit) Pedantic note: size_t needs to hold size of maximal single object, not necessarily size of all available memory. In segmented memory models it only needs to accommodate the offset when each object has to live in single segment, but with different segments more memory may be accessible. It is even possible to use it on x86 with PAE, the "long" memory model. However I've not seen anybody actually use it.
There are a number of things to say.
First, about the size of std::size_t on 32-bit systems and 64-bit systems, respectively. This is what the standard says about std::size_t (§18.2/6,7):
6 The type size_t is an implementation-defined unsigned integer type that is large enough to contain the size
in bytes of any object.
7 [ Note: It is recommended that implementations choose types for ptrdiff_t and size_t whose integer
conversion ranks (4.13) are no greater than that of signed long int unless a larger size is necessary to
contain all the possible values. — end note ]
From this it follows that std::size_t will be at least 32 bits in size on a 32-bit system, and at least 64 bits on a 64-bit system. It could be larger, but that would obviously not make any sense.
Second, about the idea of type casting: For this to work, even in theory, you would have to cast (or rather: redefine) the type inside the implementation of std::vector itself, wherever it occurs.
Third, when you say you need this super-large vector "in 32 bits", does that mean you want to use it on a 32-bit system? In that case, as the others have pointed out already, what you want is impossible, because a 32-bit system simply doesn't have that much memory.
But, fourth, if what you want is to run your program on a 64-bit machine, and use only a 32-bit data type to refer to the number of elements, but possibly a 64-bit type to refer to the total size in bytes, then std::size_t is not relevant because that is used to refer to the total number of elements, and the index of individual elements, but not the size in bytes.
Finally, if you are on a 64-bit system and want to use something of extreme proportions that works like a std::vector, that is certainly possible. Systems with 32 GB, 64 GB, or even 1 TB of main memory are perhaps not extremely common, but definitely available.
However, to implement such a data type, it is generally not a good idea to simply allocate gigabytes of memory in one contiguous block (which is what a std::vector does), because of reasons like the following:
Unless the total size of the vector is determined once and for all at initialization time, the vector will be resized, and quite likely re-allocated, possibly many times as you add elements. Re-allocating an extremely large vector can be a time-consuming operation. [ I have added this item as an edit to my original answer. ]
The OS will have difficulties providing such a large portion of unfragmented memory, as other processes running in parallel require memory, too. [Edit: As correctly pointed out in the comments, this isn't really an issue on any standard OS in use today.]
On very large servers you also have tens of CPUs and typically NUMA-type memory architectures, where it is clearly preferable to work with relatively smaller chunks of memory, and have multiple threads (possibly each running on a different core) access various chunks of the vector in parallel.
Conclusions
A) If you are on a 32-bit system and want to use a vector that large, using disk-based methods such as the one suggested by #JanHudec is the only thing that is feasible.
B) If you have access to a large 64-bit system with tens or hundreds of GB, you should look into an implementation that divides the entire memory area into chunks. Essentially something that works like a std::vector<std::vector<T>>, where each nested vector represents one chunk. If all chunks are full, you append a new chunk, etc. It is straight-forward to implement an iterator type for this, too. Of course, if you want to optimize this further to take advantage of multi-threading and NUMA features, it will get increasingly complex, but that is unavoidable.
A vector might be the wrong data structure for you. It requires storage in a single block of memory, which is limited by the size of size_t. This you can increase by compiling for 64 bit systems, but then you can't run on 32 bit systems which might be a requirement.
If you don't need vector's particular characteristics (particularly O(1) lookup and contiguous memory layout), another structure such as a std::list might suit you, which has no size limits except what the computer can physically handle as it's a linked list instead of a conveniently-wrapped array.

stl "vector<T> too long"

i read in other answers that theres no limit imposed by c++ compiler maximum size of std::vector. i am trying to use vector for one purpose, and in need to have 10^19 items.
typedef struct{
unsigned long price, weight;
}product;
//inside main
unsigned long long n = 930033404565174954;
vector<product> psorted(n);
the program breaks on the last statement. if i try resize(n) instead of initializing with n then also program breaks with message :
vector<T> too long
std::length_error at memory location
i need to sort the data accourding to price after putting in vector. what should i do ?
std::vector does have limits on how much stuff it can carry. You can query this with std::vector::max_size, which returns the maximum size you can use.
10^19 items.
Do you have 10^19 * sizeof(product) memory? I'm guessing that you don't have ~138 Exabytes of RAM. Plus, you'd have to be compiling in 64-bit mode to even consider allocating that much. The compiler isn't breaking; your execution is breaking for trying to allocate too much stuff.
Others have already told you what the problem is. One possible solution is to use the STXXL library, which is an implementation of STL that's designed for huge, out-of-memory datasets.
However, 10^19 8-byte items is 80 million TB. I'm not sure anyone has a disk that large...
Also, assuming a generous disk bandwidth of 300MB/s, this would take 8000 years to write!