Space efficient C++ vector allocator for large vectors? - c++

I'm working with some C++ code that implements a graph algorithm that uses a lot of small chunks of memory (a relative of gSpan, but that doesn't matter). The code is implemented in C++ and uses std::vectors to store many small elements (on the order of 64 bytes each). However, I'm using this on much larger data sets than the original authors, and I'm running out of memory.
It appears, however, that I'm running out of memory prematurely. Fragmentation? I suspect it is because std::vectors are trying to increase in size every time they need more memory, and vectors insist on contiguous memory. I have 8GB of ram and 18GB of swap, yet when std::bad_alloc is thrown, I'm only using 6.5GB resident and ~8GB virtual. I've caught the bad_alloc calls and printed out the vector sizes and here's what I see:
size: 536870912
capacity: 536870912
maxsize: 1152921504606846975
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
So, clearly, we've hit the maximum size of the vector and the library is trying to allocate more, and failing.
So my questions are:
Am I correct in assuming that is what the problem is?
What is the solution (besides "buy more RAM"). I'm willing to trade CPU time for fitting in memory.
Should i convert the entire code to use std::list (and somehow implement operator[] for the places the code uses it?).. would that even be more ram efficient? at the very least it would allow the list elements to be non-contiguous...right?
Is there a better allocator out there that I can use to override the standard on for vectors for this use case?
What other solutions am I missing?
Since I don't know how much memory will ultimately used, I'm aware that even if I make changes there still might not be enough memory to do my calculations, but I suspect I can get at least a lot further then I'm getting now, which seems to be giving up very quickly.

I would try using std::deque as a direct drop-in for vector. There's a possibility that since it (often) uses a collection of chunks, extending the deque could be much cheaper than extending a vector (in terms of extra memory needed).

Related

Can you predict where in memory a vector might move when growing?

I'm learning about C++ and have a conceptual question. Let's say I have a vector. I know that my vector is stored in contiguous memory, but let's say my vector keeps growing and runs out of room to keep the memory contiguous. How can I predict where in memory the vector will go? I'm excluding the option of using functions that tell the vector where it should be in memory.
If it "runs out of room to keep the memory contiguous", then it simply won't grow. Attempting to add items past the currently allocated size will (typically) result in its throwing an exception (though technically, it's up to the allocator object to decide what to do--it's responsible for memory allocation, and responding when that's not possible.
Note, however, that this could result from running out of address space (especially on a 32-bit machine) rather than running out of actual memory. A typical virtual memory manager can reallocate physical pages (e.g., 4 KB or 8 KB chunks) and write data to the paging file if necessary to free physical memory if needed--but when/if there's not enough contiguous address space, there's not much that can be done.
The answer depends highly on your allocation strategy, but in general, the answer is no. Most allocators do not provide you with information where the next allocation will occur. If you were writing a custom allocator, then you could potentially make this information accessible, but doing so is not necessarily a good idea unless your use case specifically requires this knowledge.
The realloc function is the only C function which will attempt to grow your memory in place, and it makes no guarantees that it will do so.
Neither new nor malloc provide any information for where the "next" allocation will take place. You could potentially guess, if you knew the exact implementation details for your specific compiler, but this would be very unwise to rely on in a real program. Regarding specifically the std::allocator used for std::vector, it also does not provide details about where future allocations will take place.
Even if you could predict it in a particular situation, it would be extremely fragile - all it takes is one function you call to change to make another call to new or malloc [unless you are using a very specific allocation method - which is different from the "usual" method] to "break" where the next allocation is made.
If you KNOW that you need a certain size, you can use std::vector::resize() to set the size of the vector [or std::vector<int> vec(10000); to create a pre-sized to 10000, for example] - which of course is not guaranteed to work, but it guarantees that you never need "enough space to hold 3x the current content", which is what happens with std::vector when you grow it using push_back [and if you are REALLY unlucky, that means that your vector will use 2*n-1 elements, leaving n-1 unused, because your size is n-1 and you add ONE more element, which doubles the size, so now 2*n, and you only actually require one more element...
The internal workings of STL containers are kept private for good reasons. You should never be accessing any container elements through any mechanism other than the appropriate iterators; and it is not possible to acquire one of those on an element that does not yet exist.
You could however, supply an allocator and use that to deterministically place future allocations.
Can you predict where in memory a vector might move when growing?
As others like EJP, Jerry and Mats have said, you cannot determine the location of a "grown" vector until after it grows. There are some corner cases, like the allocator providing a block of memory that's larger than required so that the vector does not actually move after a grow. But its not something you should depend on.
In general, stacks grow down and heaps grow up. This is an artifact from the old memory days. Your code segment was sandwiched between them, and it ensured your program would overwrite its own code segment and eventually cause an illegal instruction. So you might be able to guess the new vector is going to be higher in memory than the old vector because the vector is probably using heap memory. But its not really useful information.
If you are devising a strategy for locating elements after a grow, then use an index and not an iterator. Iterators are invalidated after inserts and deletes (including the grow).
For example, suppose you are parsing the vector and you are looking for the data that follows -----BEGIN CERTIFICATE-----. Once you know the offset of the data (byte 27 in the vector), then you can always relocate it in constant time with v.begin() + 26. If you only have part of the certificate and later add the tail of the data and the -----END CERTIFICATE----- (and the vector grows), then the data is still located at v.begin() + 26.
No, in practical terms you can't predict where it will go if it has to move due to resizing. However, it isn't so random that you could use it as a random number generator (;

Memory usage in C++ program

I wrote a program that needs to handle a very large data with the following libraries:
vector
boost::unordered_map
boost::unordered_multimap
So, I'm having memory problems (The program uses a LOT) and I was thinking maybe I can replace this libraries (with something that already exists or my own implementations):
So, three questions:
How much memory I'd save if I replace vector with a C array? Is it worth it?
Can someone explain how is the memory used in boost::unordered_map and boost::unordered_multimap in the current implementation? Like what's stored in order to achieve their performance.
Can you recommend me some libraries that outperform boost::unordered_map and boost::unordered_multimap in memory usage (But not something too slow) ?
std::vector is memory efficient. I don't know about the boost maps, but the Boost people usually know what they're doing, I doubt you'll save a lot of memory by creating your own variants.
You can do a few other things to help with memory issues:
Compile in 64 bit. Running out of memory in a 64 bit process is very hard.
You won't run out of memory, but memory might get swapped out. You should instead see if you need to load everything into memory at once, perhaps you can work on chunks of the data at a time.
As a side benefit, working on a chunk of the data at a time allows you to run your code in parallel.
With memory being so cheap nowadays, so that allocating 10GB of RAM is very simple, I guess your bottlenecks will be in the processing you do of the data, not of allocating the data.
These two articles explain the data structures underyling some common implementations of unordered associative containers:
Implementation of C++ unordered associative containers
Implementation of C++ unordered associative containers with duplicate elements
Even though there are some differences between implementations, they are modest --one word per element at most. If you go with minimum-overhead solutions such as sorted vectors, this would gain you 2-3 words per element, not even a 2x improvement if your objects are large. So, you'd probably be better off resorting to an environment with more memory or radically changing your approach by using a database or something.
If you have only one set of data and multiple ways of accessing it you can try to use boost::multi_index here is the documentation.
std::vector is basically a contiguous array, plus a few bytes of overhead. About the only way you'll improve with vector is by using a smaller element type. Can you store a short int instead of a regular int? If so, you can cut the vector memory down by half.
Are you perhaps using these containers to hold pointers to many objects on the heap? If so, you may have a lot of wasted space in the heap that could be saved by writing custom allocators, or by doing away with a pointer to a heap element altogether, and by storing a value type within the container.
Look within your class types. Consider all pointer types, and whether they need to be dynamic storage or not. The typical class often has pointer members hanging off a base object, which means a single object is a graph of memory chunks in itself. The more you can inline your class members, the more efficient your use of the heap will be.
RAM is cheap in 2014. Easy enough to build x86-64 Intel boxes with 64-256GB of RAM and Solid State Disk as fast swap if your current box isn't cutting it for the project. Hopefully this isn't a commercial desktop app we are discussing. :)
I ended up by changing boost::unordered_multimap for std::unordered_map of vector.
boost::unordered_multimap consumes more than two times the memory consumed by std::unordered_map of vector due the extra pointers it keeps (at least one extra pointer per element) and the fact that it stores the key and the value of each element, while unordered_map of vector only stores the key once for a vector that contains all the colliding elements.
In my particular case, I was trying to store about 4k million integers, consuming about 15 GB of memory in the ideal case. Using the multimap I get to consume more than 40 GB while using the map I use about 15 GB (a little bit more due the pointers and other structures but their size if despicable).

Vector of 20,000 small objects vs vector of 20,000 object pointers to 20,000 heap objects

Developing a 32-bit C++/carbon app under OS X Snow Leopard, ran into a problem where an stl vector of approximately 20,000 small objects (72 bytes each) was failing during a reallocation. Seems the vector, which was several megabytes in size, couldn't expand to a contiguous piece of memory, which at the point of failure was only 1.2 MB in size.
GuardMalloc[Appname-33692]: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
GuardMalloc[Appname-35026]: Failed to VM allocate 894752 bytes
GuardMalloc[ Appname-35026]: Explicitly trapping into debugger!!!
#0 0x00a30da8 in GMmalloc_zone_malloc_internal
#1 0x00a31710 in GMmalloc
#2 0x94a54617 in operator new
#3 0x0026f1d3 in __gnu_cxx::new_allocator<DataRecord>::allocate at new_allocator.h:88
#4 0x0026f1f8 in std::_Vector_base<DataRecord, std::allocator<DataRecord> >::_M_allocate at stl_vector.h:117
#5 0x0026f373 in std::vector<DataRecord, std::allocator<DataRecord> >::_M_insert_aux at vector.tcc:275
#6 0x0026f5a6 in std::vector<DataRecord, std::allocator<DataRecord> >::push_back at stl_vector.h:610
I can think of several strategies:
1) Reserve() a really, really big vector as soon as the app launches. However, this assumes the user might not load additional files that contribute to this vector, pushing it beyond the pre-allocated limit and possibly getting back into the same situation.
2) Change the vector of objects/memory allocations into a vector of pointers to objects/memory allocations. Clearly makes the vector itself a more manageable size, but then creates 20,000 small objects (which could eventually become like 50,000 objects, depending on what additional files the user loads). Does this create a gigantic overhead problem?
3) Change from a vector to a list, which may have its own overhead issues.
The vector is being constantly iterated through, and generally only appended to.
Any sage thoughts on these issues?
===============
ADDITIONAL NOTE: this particular vector just holds all imported record, so they can be indexed and sorted by ANOTHER vector that contains a sort order. Once an item is put into this vector, it stays there for the lifetime of the app (also helps support undo operations by making sure the index into the vector always remains the same for that particular object).
I think a std::deque would be more suitable than a std::list or a std::vector in your case. std::list is not efficient in iteration or random indexing, while std::vector is slow to resize (as you have observed). A std::deque does not need large amounts of memory when resizing, at the cost of slightly slower random indexing than a vector.
Don't use continuous storage like a vector. Go for a deque or list and the reallocations will not fail anymore.
If you really need high performance, consider writing your own container (ie ArrayList).
Out of your three options, 1 doesn't seem like a guaranteed solution, while 2 adds complexity and the vector still has to grow.
Option 3 seems somewhat reasonable (or possibly use a deque as mentioned in another answer) because while it's semantically similar to option 2, it provides a more normalized method of allocating a new data object. Of course this assumes that you only append data and don't need random access.
However all that said I find it incredible that your program has fragmented memory so badly that on reasonably modern hardware it can't allocate 1.2MB of memory. Far more likely is that there's some undefined behavior lurking (or possibly a memory leak) in your program causing it to behave in this way, failing to allocate the memory. You could use valgrind to help hunt down what may be going on. Does the same problem happen when you use the builtin new and delete rather than GMalloc?
Is your program being ulimited to only have access to a small amount of memory?
Finally, if valgrind finds nothing and your program really is fragmenting memory horribly, I would consider stepping back and reconsidering your approach. You may want to evaluate an alternate approach that doesn't rely on millions(?) of allocations (I just can't see a small number of allocations fragmenting the heap that much).
if even in the heap, there is not enough contigous space, use deque;
deque allocate not contigous space when it is needed. so it could handle the limit of 1.2 MB
deque is made of some number of blocks of memory not only one contigous space. that s why it could work. but it is not sure(/totaly safe) because you dont control the behaviour of the deque.
see thisabout memory fragmentation (following is copy/paste from the web page):
http://www.design-reuse.com/articles/25090/dynamic-memory-allocation-fragmentation-c.html :
Memory Fragmentation
The best way to understand memory fragmentation is to look at an example. For this example, it is assumed hat there is a 10K heap. First, an area of 3K is requested, thus:
#define K (1024)
char *p1;
p1 = malloc(3*K);
Then, a further 4K is requested:
p2 = malloc(4*K);
3K of memory is now free.
Some time later, the first memory allocation, pointed to by p1, is de-allocated:
free(p1);
This leaves 6K of memory free in two 3K chunks. A further request for a 4K allocation is issued:
p1 = malloc(4*K);
This results in a failure – NULL is returned into p1 – because, even though 6K of memory is available, there is not a 4K contiguous block available. This is memory fragmentation.
this is an issue even for kernels using virtual memory such as osx.

Using realloc in c++

std::realloc is dangerous in c++ if the malloc'd memory contains non-pod types. It seems the only problem is that std::realloc wont call the type destructors if it cannot grow the memory in situ.
A trivial work around would be a try_realloc function. Instead of malloc'ing new memory if it cannot be grown in situ, it would simply return false. In which case new memory could be allocated, the objects copied (or moved) to the new memory, and finally the old memory freed.
This seems supremely useful. std::vector could make great use of this, possibly avoiding all copies/reallocations.
preemptive flame retardant: Technically, that is same Big-O performance, but if vector growth is a bottle neck in your application a x2 speed up is nice even if the Big-O remains unchanged.
BUT, I cannot find any c api that works like a try_realloc.
Am I missing something? Is try_realloc not as useful as I imagine? Is there some hidden bug that makes try_realloc unusable?
Better yet, Is there some less documented API that performs like try_realloc?
NOTE: I'm obviously, in library/platform specific code here. I'm not worried as try_realloc is inherently an optimization.
Update:
Following Steve Jessops comment's on whether vector would be more efficient using realloc I wrote up a proof of concept to test. The realloc-vector simulates a vector's growth pattern but has the option to realloc instead. I ran the program up to a million elements in the vector.
For comparison a vector must allocate 19 times while growing to a million elements.
The results, if the realloc-vector is the only thing using the heap the results are awesome, 3-4 allocation while growing to the size of million bytes.
If the realloc-vector is used alongside a vector that grows at 66% the speed of the realloc-vector The results are less promising, allocating 8-10 times during growth.
Finally, if the realloc-vector is used alongside a vector that grows at the same rate, the realloc-vector allocates 17-18 times. Barely saving one allocation over the standard vector behavior.
I don't doubt that a hacker could game allocation sizes to improve the savings, but I agree with Steve that the tremendous effort to write and maintain such an allocator isn't work the gain.
vector generally grows in large increments. You can't do that repeatedly without relocating, unless you carefully arrange things so that there's a large extent of free addresses just above the internal buffer of the vector (which in effect requires assigning whole pages, because obviously you can't have other allocations later on the same page).
So I think that in order to get a really good optimization here, you need more than a "trivial workaround" that does a cheap reallocation if possible - you have to somehow do some preparation to make it possible, and that preparation costs you address space. If you only do it for certain vectors, ones that indicate they're going to become big, then it's fairly pointless, because they can indicate with reserve() that they're going to become big. You can only do it automatically for all vectors if you have a vast address space, so that you can "waste" a big chunk of it on every vector.
As I understand it, the reason that the Allocator concept has no reallocation function is to keep it simple. If std::allocator had a try_realloc function, then either every Allocator would have to have one (which in most cases couldn't be implemented, and would just have to return false always), or else every standard container would have to be specialized for std::allocator to take advantage of it. Neither option is a great Allocator interface, although I suppose it wouldn't be a huge effort for implementers of almost all Allocator classes just to add a do-nothing try_realloc function.
If vector is slow due to re-allocation, deque might be a good replacement.
You could implement something like the try_realloc you proposed, using mmap with MAP_ANONYMOUS and MAP_FIXED and mremap with MREMAP_FIXED.
Edit: just noticed that the man page for mremap even says:
mremap() uses the Linux page table
scheme. mremap() changes the
mapping between
virtual addresses and memory pages. This can be used to implement
a very efficient
realloc(3).
realloc in C is hardly more than a convenience function; it has very little benefit for performance/reducing copies. The main exception I can think of is code that allocates a big array then reduces the size once the size needed is known - but even this might require moving data on some malloc implementations (ones which segregate blocks strictly by size) so I consider this usage of realloc really bad practice.
As long as you don't constantly reallocate your array every time you add an element, but instead grow the array exponentially (e.g. by 25%, 50%, or 100%) whenever you run out of space, just manually allocating new memory, copying, and freeing the old will yield roughly the same (and identical, in case of memory fragmentation) performance to using realloc. This is surely the approach that C++ STL implementations use, so I think your whole concern is unfounded.
Edit: The one (rare but not unheard-of) case where realloc is actually useful is for giant blocks on systems with virtual memory, where the C library interacts with the kernel to relocate whole pages to new addresses. The reason I say this is rare is because you need to be dealing with very big blocks (at least several hundred kB) before most implementations will even enter the realm of dealing with page-granularity allocation, and probably much larger (several MB maybe) before entering and exiting kernelspace to rearrange virtual memory is cheaper than simply doing the copy. Of course try_realloc would not be useful here, since the whole benefit comes from actually doing the move inexpensively.

Multithreading not taking advantage of multiple cores?

My computer is a dual core core2Duo. I have implemented multithreading in a slow area of my application but I still notice cpu usage never exceeds 50% and it still lags after many iterations. Is this normal? I was hopeing it would get my cpu up to 100% since im dividing it into 4 threads. Why could it still be capped at 50%?
Thanks
See What am I doing wrong? (multithreading)
for my implementation, except I fixed the issue that that code was having
Looking at your code, you are making a huge number of allocations in your tight loop--in each iteration you dynamically allocate two, two-element vectors and then push those back onto the result vector (thus making copies of both of those vectors); that last push back will occasionally cause a reallocation and a copy of the vector contents.
Heap allocation is relatively slow, even if your implementation uses a fast, fixed-size allocator for small blocks. In the worst case, the general-purpose allocator may even use a global lock; if so, it will obliterate any gains you might get from multithreading, since each thread will spend a lot of time waiting on heap allocation.
Of course, profiling would tell you whether heap allocation is constraining your performance or whether it's something else. I'd make two concrete suggestions to cut back your heap allocations:
Since every instance of the inner vector has two elements, you should consider using a std::array (or std::tr1::array or boost::array); the array "container" doesn't use heap allocation for its elements (they are stored like a C array).
Since you know roughly how many elements you are going to put into the result vector, you can reserve() sufficient space for those elements before inserting them.
From your description we have very little to go on, however, let me see if I can help:
You have implemented a lock-based system but you aren't judiciously using the resources of the second, third, or fourth threads because the entity that they require is constantly locked. (this is a very real and obvious area I'd look into first)
You're not actually using more than a single thread. Somehow, somewhere, those other threads aren't even fired up or initialized. (sounds stupid but I've done this before)
Look into those areas first.