Sudden memory spike at 700,000 vector elements - c++

So I was running a memory usage test in my program where I was adding 20 elements each to two separate vectors each frame (~60fps). I expected that at some point I would start seeing a memory leak, but instead memory usage remained constant until a certain critical point. Around 700,000 elements total it spiked and then leveled off again at the new plateau.
I have a feeling this has something to do with the allocation of the vectors being automatically increased at that point, but I'm not sure and can't find anything online. It also wouldn't really explain why so much extra memory was allocated at that point (Private Bytes on the CPU jumped from ~800 to ~900, and System GPU Memory jumped from ~20 to ~140). Here are the Process Explorer graphs for both CPU and GPU:
Note: the dips in both CPU and GPU usage are from me pausing the program after seeing the spike.
Can anyone explain this to me?
Edit: Here's a simpler, more general test:
The total usage is obviously a lot lower, but same idea.

When you add an element to an empty vector, it will allocate enough space via new for several elements. Like 16 maybe. It does this because resizing the array to a larger buffer is slow, so it allocates more than it needs. If it allocates room for 16 elements, that means you can push back 15 more before it needs to bother with another call to new. Each time it grows significantly more. If you have 500 elements (and it's out of room) and you push back one more, it may allocate room for 750. Or maybe even 1000. Or 2000. Plenty of room.
It turns out that when you (or the vector) call new, you get this from the program's memory manager. If the program's memory manager doesn't have enough memory available, it will ask the operating system for a HUGE amount of memory, because operating system calls are themselves slow, and messing with page mappings is slow. So when vector asks for room for 200 bytes, the program's memory manager may actually grab like 65536 bytes, and then give the vector only 200 of that, and save the remaining 65336 bytes for the next call(s) to new. Because of this, you (or vector) can call new many many times before you have to bother the operating system again, and things go quickly.
But this has a side effect: The operating system can't actually tell how much memory your program is really using. All it knows is that you allocated 65536 from it, so it reports that. As you push back elements in the vector, eventually the vector runs out of capacity, and asks more from the program's memory manager. And as it does that more and more, and the operating system reports the same memory usage, because it can't see. Eventually the memory manager runs out of capacity, and asks for more from the operating system. The operating system allocates another huge block (65536? 131072?) and you see a sudden large spike in memory usage.
The vector size at which this happens isn't set, it depends on what else has also been allocated, and what order they were allocated and deallocated. Even the things that you deleted still affect things, it's pretty darn complicated. Also, the rate at which vector grows depending on your library implementation, and the amount of memory the program's memory manager grabs from the OS also varies depending on factors even I don't know about.
I have no idea why the GPU's memory would spike, it depends on what you were doing with your program. But note that there is less GPU memory total, it's entirely possible it grew by a smaller amount than "private bytes".

Vectors use a dynamically allocated array to store their elements.This array may need to be reallocated in order to grow in size when new elements are inserted, which implies allocating a new array and moving all elements to it. This is a relatively expensive task in terms of processing time, and thus, vectors do not reallocate each time an element is added to the container. Instead, vector containers may allocate some extra storage to accommodate for possible growth, and thus the container may have an actual capacity greater than the storage strictly needed to contain its elements (i.e., its size). This explains your plateaus. The capacity enlargement is done by doubling the current one. It may be the case that after a finite number of doubling, its quadrupling the capacity. Which will explain your spike.

Vector Size and Capacity
For good performance by allocating more memory than ever needed.
size()
Returns the current number of elements
empty()
Returns whether the container is empty (equivalent to 0==size() but faster)
capacity()
Returns the maximum possible number of elements without reallocation
reserve(num)
Enlarges capacity, if not enough yet
The capacity of a vector is important, because
Reallocation invalidates all references, pointers, and iterators for elements.
Reallocation takes time by moving all elements to new heap location.
Reallocation size increment depends on the actual vector implementation.
Code example for using reserve(num):
std::vector<int> v1;  // Create an empty vector    
v1.reserve(80);       // Reserve memory for 80 elements

Related

How C++ heap managers track the size of allocated objects?

I want to estimate memory consumption in C++ when I allocate objects in the heap. I can start my calculation with sizeof(object) and round it up to the nearest multiple of the heap block (typically 8 bytes). But if the whole allocated block goes to the allocated object, how can the heap manager tell the size of the object when I ask it to delete the pointer?
If the heap manager tracks the size of each object, does it mean that I should add ~4 bytes per allocated object in my calculation to the total memory consumption for heap manager internal expenses? Or does it store this information in much more compact form? What are extra costs (memory-wise) of heap memory allocation?
I understand that my question is very implementation specific, but I appreciate any hints about the heap metadata storage of major implementations such as gcc (or maybe it is about libc).
Heap allocators are not free. There is the cost per block of allocation (both in size and perhaps search if using a find best algorithm), the cost of joining blocks upon free, and any lost size per block when the size requested is smaller than the block size returned. There is also a cost in terms of memory fragmentation. Consider placing a small 1 byte allocation in the middle of your heap. At this point, you can no longer return a contiguous block larger than 1/2 the heap - half your heap is fragmented. Good allocators fight all the above and try to maximize all benefits.
Consider the following allocation system (used in many real world applications over a decade on numerous handheld game devices.)
Create a primary heap where each allocation has a prev ptr, next ptr, size, and possibly other info. Round it to 16 bytes per entry. Store this info before or after the actual memory pointer being returned - your choice as there are advantages to each. Yes, you are allocating requested size + 16 bytes here.
Now just keep a pointer to a free list and possibly a used list.
Allocation is done by finding a block on the free list big enough for our use and splitting it into the size requested and the remainder (first fit), or by searching the whole list for as exact a match as possible (best fit). Simple enough.
Freeing is moving the current item back into the free list, joining areas adjacent to each other if possible. You can see how this can go to O(n).
For smaller allocations, get a single allocation (either from newly created heap, or from global memory) that will be your unit allocation zone. Split this area up into "block size" chunk addresses and push these addresses onto a free stack. Allocation is popping an address from this list. Freeing is just pushing the allocation back to the list - both O(1).
Then inside your malloc/new/etc, check size if it is within the unit size, allocate from the unit allocator, otherwise use the O(n) allocator. My studies have shown that you can get 90-95% of allocations to fit within the unit allocator's block size without too much issue.
Also, you can allocate slabs of memory for memory pools and leave them allocated while using them over and over. A few larger allocations are much cheaper to manage (Unix systems use this alot...)
Advantages:
All unit allocations have no external fragmentation.
Small allocations run constant time.
Larger allocation can be tailored to the exact size of the data requested.
Disadvantages:
You pay an internal fragmentation cost when you want memory smaller than the block requsted.
Many, many allocations and frees outside your small block size will eventually fragment your memory. This can lead to all sorts of unpleasantness. This can be managed with OS help or with care on alloc/free order.
Once you get 1000s of large allocations going, there is a CPU price. Slabs / multiple heaps can be used to fix this as well.
There are many, many schemes out there, but this one is simple and has been used in commercial apps for ages though, so I wanted to start here.

bad_alloc::`scalar deleting destructor'(unsigned int) when im trying to create a vector 470mb size

The most interesting in this case, that yesterday it's work okay.
i have no idea what causes the trouble
size = 480 000 000;
std::vector<char> result(size);
Vector tries to allocate a continuous memory block of required size. Depending on system memory fragmentation there might be no 0.5Gb block available and your memory allocation fails.
The most interesting in this case, that yesterday it's work okay.
The contents of a std::vector are not stored in thin air; they occupy the memory in your computer. Obviously, the memory situation in your computer changes all the time. Either the total free space available became smaller since yesterday, or there is more fragmentation now, so that even if the total free space is well over 470MB, there is no free contiguous 470MB spot anywhere.
Perhaps a std::deque can fix the problem elegantly enough in the short term:
As opposed to std::vector, the elements of a deque are not stored
contiguously: typical implementations use a sequence of individually
allocated fixed-size arrays, [...]
This comes with a few disadvantages, of course, as the same documentation explains.
You are allocation a lot of contiguous memory so if you run out of memory you get bad_alloc.

Visual C++ vector erase increases memory usage?

it's my very first question here on stackoverflow. I have largely looked for a reason for what I'm experiencing with the following lines of code:
unsigned long long _mem1= getUsedVirtualMemory();
vector.erase(vector.begin() + _idx);
contained= false; // don't stop the loop
_idx--; // the removed object has redefined the idx to be consider.
_mem1 = getUsedVirtualMemory() - _mem1;
if (_mem1 > 0) printf("Memory - 2 mem1: %lu\n" , _mem1);
I have a huge memory consumption in my program and after an intensive debug session, some printfs and time consuming analyses, I arrived to this point:
getUsedVirtualMemory is implemented with the following code:
PROCESS_MEMORY_COUNTERS_EX pmc;
GetProcessMemoryInfo(GetCurrentProcess(), (PROCESS_MEMORY_COUNTERS*) &pmc, sizeof(pmc));
SIZE_T virtualMemUsedByMe = pmc.PrivateUsage;
return virtualMemUsedByMe;
to obtain the amount of virtual memory allocated by the process; the vector is a vector of objects (not pointers).
In most cases the vector's erase method seems to work as expected, but in some cases it looks like the erase method of that vector increases the memory used by the process instead of freeing it. I'm using the Windows system function GetProcessMemoryInfo in a lot of situations around the code to debug this problem and it seems to return an actual value for used virtual memory.
I'm using Visual Studio C++ 2010 Professional.
Let me know if more information are needed.
Thanks for any replies.
UPDATE:
Everything you wrote in your replies is correct and I forgot the following details:
I already know that a vector has a size (actual number of elements) and a capacity (allocated slots to store elements)
I already know that the erase method does not free memory (I looked for a lot of documentation about that method)
finally, I will add other elements to that vector later, so I don't need to shrink that vector.
The actual problem is that in that case, the value of "_mem1" in the last line of code shows a difference of 1.600.000 bytes: unjustified increase of memory, while I expected to be 0 bytes.
Also in the case where the value of used memory after the operation would be less than the first one, I would expect a very big number for what is explained for instance at Is unsigned integer subtraction defined behavior?
Instead of the expected results I get a value greater than 0 but relatively short.
To better understand the incidence of the problem, iterating some thousands of times on that piece of code, unexpectedly allocates about 20 Gb of virtual memory.
A vector has :
a size(), which indicates how many active elements are in the container
a capacity(), which tells how many elements are reserved in memory
erase() changes the size to zero. It does not free memory capactiy allocated.
You can use shrink_to_fit() which makes sure that the capacity is reduced to the size.
Changing the size with resize() or the capacity with reserve(), may increase memory allocated if necessary, but it does not necessarily free memory if the new size/capacity is lower than the existing capacity.
It's because erase will not free memory, it just erases elements. Take a look at Herbs.
To (really) release the memory you could do (from reference link):
The Right Way To "Shrink-To-Fit" a vector or deque
So, can we write code that does shrink a vector "to fit" so that its capacity is just enough to hold the contained elements? Obviously reserve() can't do the job, but fortunately there is indeed a way:
vector<Customer>( c ).swap( c );
// ...now c.capacity() == c.size(), or
// perhaps a little more than c.size()
vector.ersase() is only guaranteed to remove elements from the vector, is is not guaranteed to reduce the size of the underlying array (as that process is rather expensive). IE: It only zeros out data, it doesn't necessarily deallocate it.
If you need to have a vector that is only as large as the elements it contains, try using vector.resize(vector.size())
IIRC in a Debug build on windows, new is actually #defined to be DEBUG_NEW which causes (amongst other things) memory blocks not to be actually freed, but merely marked as 'deleted'.
Do you get the same behaviour with a release build?
One part of the puzzle might be that std::vector cannot delete entries from the underlying memory buffer if they are not at the end of the buffer (which yours aren't), so the kept entries are moved - potentially to an altogether different buffer. Since you're erasing the first element, std::vector is allowed (since the standard states that erase() invalidates all iterators at/after the point of erasure, all of them in your case) to allocate an additional buffer to copy the remaining elements to, and then discard the old buffer after copying. So you may end up with two buffers being in use at the same time, and your memory manager will likely not return the discarded buffer to the operating system, but rather keep the memory around to re-use it in a subsequent allocation. This would explain the memory usage increase for a single one of your loop iterations.

Why doesn't std::vector::resize() deallocate the memory when you want to reduce the size of the vector?

I don't understand why I have to use std::vector::swap() to deallocate the memory of a vector when I want to reduce its size.
Why would someone want to reduce the size of a vector and at the same time keep the rest of the unused memory allocated?
Reducing the capacity of a vector involves
allocating new storage
copying over the existing elements
deallocating old storage
None of this is for free. Really shrinking a vector has a cost associated to it, and it isn't clear that users should pay it by default.
To answer your question why someone would want to reduce a vector without deallocating, here is another scenario I'm currently working on:
I need a continuous buffer whose size is not known a priori so I use a vector to build it up. I'm actually going to need this many many times (>100k) during the execution of the program and each time the buffer may be slightly different in size. I want to recycle this buffer so I don't incur the expensive reallocation every cycle. The first few cycles are quite expensive since they reallocate several times during all the push_backs. After the first 100 cycles though, the capacity of the vector has grown to size where it is big enough to hold what it needs thus, I'm no longer allocating and deallocating memory every cycle and I'm only using about as much memory as is needed (important when running with many threads and potentially bumping into the total system RAM).

Vector of 20,000 small objects vs vector of 20,000 object pointers to 20,000 heap objects

Developing a 32-bit C++/carbon app under OS X Snow Leopard, ran into a problem where an stl vector of approximately 20,000 small objects (72 bytes each) was failing during a reallocation. Seems the vector, which was several megabytes in size, couldn't expand to a contiguous piece of memory, which at the point of failure was only 1.2 MB in size.
GuardMalloc[Appname-33692]: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
GuardMalloc[Appname-35026]: Failed to VM allocate 894752 bytes
GuardMalloc[ Appname-35026]: Explicitly trapping into debugger!!!
#0 0x00a30da8 in GMmalloc_zone_malloc_internal
#1 0x00a31710 in GMmalloc
#2 0x94a54617 in operator new
#3 0x0026f1d3 in __gnu_cxx::new_allocator<DataRecord>::allocate at new_allocator.h:88
#4 0x0026f1f8 in std::_Vector_base<DataRecord, std::allocator<DataRecord> >::_M_allocate at stl_vector.h:117
#5 0x0026f373 in std::vector<DataRecord, std::allocator<DataRecord> >::_M_insert_aux at vector.tcc:275
#6 0x0026f5a6 in std::vector<DataRecord, std::allocator<DataRecord> >::push_back at stl_vector.h:610
I can think of several strategies:
1) Reserve() a really, really big vector as soon as the app launches. However, this assumes the user might not load additional files that contribute to this vector, pushing it beyond the pre-allocated limit and possibly getting back into the same situation.
2) Change the vector of objects/memory allocations into a vector of pointers to objects/memory allocations. Clearly makes the vector itself a more manageable size, but then creates 20,000 small objects (which could eventually become like 50,000 objects, depending on what additional files the user loads). Does this create a gigantic overhead problem?
3) Change from a vector to a list, which may have its own overhead issues.
The vector is being constantly iterated through, and generally only appended to.
Any sage thoughts on these issues?
===============
ADDITIONAL NOTE: this particular vector just holds all imported record, so they can be indexed and sorted by ANOTHER vector that contains a sort order. Once an item is put into this vector, it stays there for the lifetime of the app (also helps support undo operations by making sure the index into the vector always remains the same for that particular object).
I think a std::deque would be more suitable than a std::list or a std::vector in your case. std::list is not efficient in iteration or random indexing, while std::vector is slow to resize (as you have observed). A std::deque does not need large amounts of memory when resizing, at the cost of slightly slower random indexing than a vector.
Don't use continuous storage like a vector. Go for a deque or list and the reallocations will not fail anymore.
If you really need high performance, consider writing your own container (ie ArrayList).
Out of your three options, 1 doesn't seem like a guaranteed solution, while 2 adds complexity and the vector still has to grow.
Option 3 seems somewhat reasonable (or possibly use a deque as mentioned in another answer) because while it's semantically similar to option 2, it provides a more normalized method of allocating a new data object. Of course this assumes that you only append data and don't need random access.
However all that said I find it incredible that your program has fragmented memory so badly that on reasonably modern hardware it can't allocate 1.2MB of memory. Far more likely is that there's some undefined behavior lurking (or possibly a memory leak) in your program causing it to behave in this way, failing to allocate the memory. You could use valgrind to help hunt down what may be going on. Does the same problem happen when you use the builtin new and delete rather than GMalloc?
Is your program being ulimited to only have access to a small amount of memory?
Finally, if valgrind finds nothing and your program really is fragmenting memory horribly, I would consider stepping back and reconsidering your approach. You may want to evaluate an alternate approach that doesn't rely on millions(?) of allocations (I just can't see a small number of allocations fragmenting the heap that much).
if even in the heap, there is not enough contigous space, use deque;
deque allocate not contigous space when it is needed. so it could handle the limit of 1.2 MB
deque is made of some number of blocks of memory not only one contigous space. that s why it could work. but it is not sure(/totaly safe) because you dont control the behaviour of the deque.
see thisabout memory fragmentation (following is copy/paste from the web page):
http://www.design-reuse.com/articles/25090/dynamic-memory-allocation-fragmentation-c.html :
Memory Fragmentation
The best way to understand memory fragmentation is to look at an example. For this example, it is assumed hat there is a 10K heap. First, an area of 3K is requested, thus:
#define K (1024)
char *p1;
p1 = malloc(3*K);
Then, a further 4K is requested:
p2 = malloc(4*K);
3K of memory is now free.
Some time later, the first memory allocation, pointed to by p1, is de-allocated:
free(p1);
This leaves 6K of memory free in two 3K chunks. A further request for a 4K allocation is issued:
p1 = malloc(4*K);
This results in a failure – NULL is returned into p1 – because, even though 6K of memory is available, there is not a 4K contiguous block available. This is memory fragmentation.
this is an issue even for kernels using virtual memory such as osx.