I want to estimate memory consumption in C++ when I allocate objects in the heap. I can start my calculation with sizeof(object) and round it up to the nearest multiple of the heap block (typically 8 bytes). But if the whole allocated block goes to the allocated object, how can the heap manager tell the size of the object when I ask it to delete the pointer?
If the heap manager tracks the size of each object, does it mean that I should add ~4 bytes per allocated object in my calculation to the total memory consumption for heap manager internal expenses? Or does it store this information in much more compact form? What are extra costs (memory-wise) of heap memory allocation?
I understand that my question is very implementation specific, but I appreciate any hints about the heap metadata storage of major implementations such as gcc (or maybe it is about libc).
Heap allocators are not free. There is the cost per block of allocation (both in size and perhaps search if using a find best algorithm), the cost of joining blocks upon free, and any lost size per block when the size requested is smaller than the block size returned. There is also a cost in terms of memory fragmentation. Consider placing a small 1 byte allocation in the middle of your heap. At this point, you can no longer return a contiguous block larger than 1/2 the heap - half your heap is fragmented. Good allocators fight all the above and try to maximize all benefits.
Consider the following allocation system (used in many real world applications over a decade on numerous handheld game devices.)
Create a primary heap where each allocation has a prev ptr, next ptr, size, and possibly other info. Round it to 16 bytes per entry. Store this info before or after the actual memory pointer being returned - your choice as there are advantages to each. Yes, you are allocating requested size + 16 bytes here.
Now just keep a pointer to a free list and possibly a used list.
Allocation is done by finding a block on the free list big enough for our use and splitting it into the size requested and the remainder (first fit), or by searching the whole list for as exact a match as possible (best fit). Simple enough.
Freeing is moving the current item back into the free list, joining areas adjacent to each other if possible. You can see how this can go to O(n).
For smaller allocations, get a single allocation (either from newly created heap, or from global memory) that will be your unit allocation zone. Split this area up into "block size" chunk addresses and push these addresses onto a free stack. Allocation is popping an address from this list. Freeing is just pushing the allocation back to the list - both O(1).
Then inside your malloc/new/etc, check size if it is within the unit size, allocate from the unit allocator, otherwise use the O(n) allocator. My studies have shown that you can get 90-95% of allocations to fit within the unit allocator's block size without too much issue.
Also, you can allocate slabs of memory for memory pools and leave them allocated while using them over and over. A few larger allocations are much cheaper to manage (Unix systems use this alot...)
Advantages:
All unit allocations have no external fragmentation.
Small allocations run constant time.
Larger allocation can be tailored to the exact size of the data requested.
Disadvantages:
You pay an internal fragmentation cost when you want memory smaller than the block requsted.
Many, many allocations and frees outside your small block size will eventually fragment your memory. This can lead to all sorts of unpleasantness. This can be managed with OS help or with care on alloc/free order.
Once you get 1000s of large allocations going, there is a CPU price. Slabs / multiple heaps can be used to fix this as well.
There are many, many schemes out there, but this one is simple and has been used in commercial apps for ages though, so I wanted to start here.
Related
The idea is writing a memory manager that allocates a bunch of memory at a time to minimize malloc and free calls, i've tried writing this by my self two times but both times i ran into the problem of defragmenting memory.
You could just check if a block is empty every so often, and if it is empty delete it. But let's say your blocks are 100 bytes each, first you allocate 20 bytes of memory, this will create a new 100 byte block because no blocks exist yet, then you allocate 80 bytes and this fills the first block, then you allocate another 20 bytes and this will create another new block because this first block is full, then you free the second allocation (80 bytes) and that leaves you with two blocks of which only the first 20 bytes are used, this means you have 100 bytes allocated that could be freed by moving the 20 bytes from the second block into the first block and deleting the second block.
These are the problems i ran into:
you can't move the memory around because this means all pointers to that memory will have to be updated, and for that to happen you need to know their addresses, which you don't;
100 bytes is a very small block size, what if i want to store a very low-res (64,64) ARGB image in memory? This will use 16KB of memory and moving all of that might be even slower than just not writing a memory manager at all.
Is it even worth it writing a custom memory manager after all that?
Is it even worth it writing a custom memory manager after all that?
This is asking for an opinion, but I'll try to give a factual answer.
The memory allocators that come with most operating systems and language support libraries are generally very high quality and are designed to address the types of problems you encountered (fragmentation and performance) as well as others. They are about as good as general purpose memory allocators can be.
You can do (a little) better than the provided memory allocator if your application has a particular allocation pattern that can be exploited. That's rare, but you can generally take advantage of it by making something substantially simpler than a general purpose memory manager.
you can't move the memory around
True. Most modern systems don't even try to move memory around--they try to avoid fragmentation to begin with (typically by clustering similarly sized allocations).
Old systems (ones without virtual memory managers) sometimes used memory managers that had an extra layer of indirection. Instead of returning a pointer to the allocated memory, the allocator would return an "handle", which could be as simple as an index into a table maintained by the memory manager. When the user wanted to actually access the memory, they would "lock" it. The memory manager was free to move around memory that wasn't locked (e.g., to eliminate fragmentation) because the handles gave an extra level of indirection.
what if i want to store a very low-res (64,64) ARGB image
Most memory managers provide a range of sizes so a large allocation wouldn't be split across n smaller blocks. Most will punt very large allocations to the system allocator, which, on a virtual memory operating system, can generally solve the problem unless the process address space is overly fragmented.
So I was running a memory usage test in my program where I was adding 20 elements each to two separate vectors each frame (~60fps). I expected that at some point I would start seeing a memory leak, but instead memory usage remained constant until a certain critical point. Around 700,000 elements total it spiked and then leveled off again at the new plateau.
I have a feeling this has something to do with the allocation of the vectors being automatically increased at that point, but I'm not sure and can't find anything online. It also wouldn't really explain why so much extra memory was allocated at that point (Private Bytes on the CPU jumped from ~800 to ~900, and System GPU Memory jumped from ~20 to ~140). Here are the Process Explorer graphs for both CPU and GPU:
Note: the dips in both CPU and GPU usage are from me pausing the program after seeing the spike.
Can anyone explain this to me?
Edit: Here's a simpler, more general test:
The total usage is obviously a lot lower, but same idea.
When you add an element to an empty vector, it will allocate enough space via new for several elements. Like 16 maybe. It does this because resizing the array to a larger buffer is slow, so it allocates more than it needs. If it allocates room for 16 elements, that means you can push back 15 more before it needs to bother with another call to new. Each time it grows significantly more. If you have 500 elements (and it's out of room) and you push back one more, it may allocate room for 750. Or maybe even 1000. Or 2000. Plenty of room.
It turns out that when you (or the vector) call new, you get this from the program's memory manager. If the program's memory manager doesn't have enough memory available, it will ask the operating system for a HUGE amount of memory, because operating system calls are themselves slow, and messing with page mappings is slow. So when vector asks for room for 200 bytes, the program's memory manager may actually grab like 65536 bytes, and then give the vector only 200 of that, and save the remaining 65336 bytes for the next call(s) to new. Because of this, you (or vector) can call new many many times before you have to bother the operating system again, and things go quickly.
But this has a side effect: The operating system can't actually tell how much memory your program is really using. All it knows is that you allocated 65536 from it, so it reports that. As you push back elements in the vector, eventually the vector runs out of capacity, and asks more from the program's memory manager. And as it does that more and more, and the operating system reports the same memory usage, because it can't see. Eventually the memory manager runs out of capacity, and asks for more from the operating system. The operating system allocates another huge block (65536? 131072?) and you see a sudden large spike in memory usage.
The vector size at which this happens isn't set, it depends on what else has also been allocated, and what order they were allocated and deallocated. Even the things that you deleted still affect things, it's pretty darn complicated. Also, the rate at which vector grows depending on your library implementation, and the amount of memory the program's memory manager grabs from the OS also varies depending on factors even I don't know about.
I have no idea why the GPU's memory would spike, it depends on what you were doing with your program. But note that there is less GPU memory total, it's entirely possible it grew by a smaller amount than "private bytes".
Vectors use a dynamically allocated array to store their elements.This array may need to be reallocated in order to grow in size when new elements are inserted, which implies allocating a new array and moving all elements to it. This is a relatively expensive task in terms of processing time, and thus, vectors do not reallocate each time an element is added to the container. Instead, vector containers may allocate some extra storage to accommodate for possible growth, and thus the container may have an actual capacity greater than the storage strictly needed to contain its elements (i.e., its size). This explains your plateaus. The capacity enlargement is done by doubling the current one. It may be the case that after a finite number of doubling, its quadrupling the capacity. Which will explain your spike.
Vector Size and Capacity
For good performance by allocating more memory than ever needed.
size()
Returns the current number of elements
empty()
Returns whether the container is empty (equivalent to 0==size() but faster)
capacity()
Returns the maximum possible number of elements without reallocation
reserve(num)
Enlarges capacity, if not enough yet
The capacity of a vector is important, because
Reallocation invalidates all references, pointers, and iterators for elements.
Reallocation takes time by moving all elements to new heap location.
Reallocation size increment depends on the actual vector implementation.
Code example for using reserve(num):
std::vector<int> v1; // Create an empty vector
v1.reserve(80); // Reserve memory for 80 elements
I was doing some experiments on the heap address growth, and something interesting happened.
(OS: CentOS, )
But I don't understand, why this happened? Thanks!
This is what I did first:
double *ptr[1000];
for (int i=0;i<1000;i++){
ptr[i] = new double[**10000**];
cout << ptr[i] << endl;
}
The output is incremental(for the last few lines):
....
....
0x2481be0
0x2495470
0x24a8d00
0x24bc590
0x24cfe20
0x24e36b0
0x24f6f40
0x250a7d0
0x251e060
Then I changed 10000 to 20000:
double *ptr[1000];
for (int i=0;i<1000;i++){
ptr[i] = new double[**20000**];
cout << ptr[i] << endl;
}
The address became more like the address of stack space(and decremental):
....
....
0x7f69c4d8a010
0x7f69c4d62010
0x7f69c4d3a010
0x7f69c4d12010
0x7f69c4cea010
0x7f69c4cc2010
0x7f69c4c9a010
0x7f69c4c72010
0x7f69c4c4a010
0x7f69c4c22010
0x7f69c4bfa010
0x7f69c4bd2010
0x7f69c4baa010
0x7f69c4b82010
Different environments/implementations allocate memory using different strategies, so there is no one correct rule. However, a common pattern is to use different allocation strategies for small objects vs. large objects.
Often, a runtime will have multiple heaps for objects of different sizes, which are optimized for different usage patterns. For example, small objects tend to be allocated often and deleted quickly, while large objects tend to be created rarely and have a long life.
If you use a single heap for everything, then a few small objects will be quickly peppered throughout your memory space, leaving lots of medium sized blocks available but few or no large blocks needed for larger objects. This is referred to as memory fragmentation, and can cause your allocation to fail even if nominally your app has tons of memory available.
Another reason to use different heaps is to use a different usage tracking method for different object sizes. For example, an implementation might request a new memory block from the OS for large objects, and for small objects, use a few smaller OS memory blocks with sub-allocations handled by the C runtime heap manager. Memory usage tracking mechanisms that are very effective for large objects can be very expensive for smaller ones because the memory used for tracking usage becomes a significant fraction of the actual memory used by each object.
In your case, my guess is that the runtime is allocating small objects at the beginning of the memory space, bottom-up, and larger ones near the end, top-down, to avoid fragmentation.
You're not going to get a great answer here, because the new function can choose any method it wants to allocate memory. My guess would be that the algorithm here broke the pool into small and large allocation pools, and the big allocation pool grows downward so they can meet in the middle (so as to not waste any space).
On UNIX, allocators use sbrk(2) and mmap(2) to get memory from the OS. The addresses returned by sbrk are well defined, but the addresses from mmap are "whatever is available". On Windows, allocators use VirtualAlloc() which is kind of like mmap.
Implementations are free to have hybrids of different allocation schemes. In C++, it's normal for there to be thousands - even millions - of relatively small objects, so it can make sense for the library's memory allocation routines to make sure they pack well and are very lightweight. Your allocations for 10000 doubles do that: they're 80016 bytes apart - 80000 for 10000 8 byte variables and just 16 bytes padding. Node specifically that the size has no relationship to powers of two, whereas when allocation 20000 doubles they're decrementing by 163840 bytes each time... weirdly, exactly 10 * 2^14. That suggests to me that the former allocations are being satisfied from one heap designed to support efficient small-object allocation by the C++ allocation new function, while the latter has crossed a theshold and is probably being sent to malloc for memory coming from a distinct heap, with much more waste.
You were lucky in the sense that the sizes of 10000 doubles and 20000 doubles happen to lie on the opposite sides of a critical thresholds called MMAP_THRESHOLD.
MMAP_THRESHOLD is 128KB by default. So, 80KB (i.e., 10000 doubles) mem alloc requests are serviced over heap, and whereas 160KB (20000 doubles) mem alloc requests are serviced by anonymous memory mapping (through mmap sys call). (Note that using mem mapping for large mem alloc may incur additional penalties due to its different underlying mem alloc handling mechanism. You may want to tune MMAP_THRESHOLD for optimal performance of your apps.)
In Linux Man for malloc:
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).
C++ has two main memory types: heap and stack. With stack everything is clear to me, but concerning heap a question remains:
How is heap memory organised? I read about the heap data structure, but that seems not the applicable to the heap, because in C++ I can access any data from the heap, not just a minimum or maximum value.
So my question is: how is heap memory organized in C++? When we read "is allocated on the heap" what memory organization do we have in mind?
A common (though certainly not the only) implementation of the free store is a linked list of free blocks of memory (i.e., blocks not currently in use). The heap manager will typically allocate large blocks of memory (e.g., 1 megabyte at a time) from the operating system, then split these blocks up into pieces for use by your code when you use new and such. When you use delete, the block of memory you quit using will be added as a node on that linked list.
There are various strategies for how you use those free blocks. Three common ones are best fit, worst fit and first fit.
In best fit, you try to find a free block that's closest in size to the required allocation. If it's larger than required (typically after rounding the allocation size) you split it into two pieces: one to return for the allocation, and a new (smaller) free block to put back on the list to meet other allocation requests. Although this can seem like a good strategy, it's often problematic. The problem is that when you find the closest fit, the left-over block is often too small to be of much use. After a short time, you end up with a huge number of tiny blocks of free space, none of which is good for much of anything.
Worst fit combats that by instead finding the worst fit among the free blocks -- IOW, the largest block of free space. When it splits that block, what's left over will be as large as possible, maximizing the chance that it'll be useful for other allocations.
First fist just walks through the list of free blocks, and (as you'd guess from the name) uses the first block that's large enough to meet the requirement.
Quite a few also start with a search for an exact fit, and use that in preference to splitting a block.
Quite a few also keep (for example) a number of separate linked lists for different allocation sizes to minimize searching for a block of the right size.
In quite a few cases, the manager also has some code to walk through the list of free blocks, to find any that are right next to each other. If it finds them, it will join the two small blocks into one larger bock. Sometimes this is done right when you free/delete a block. More often, it's done lazily, to avoid joining then re-splitting blocks when/if you're using a lot of the same size of blocks (which is fairly common).
Another possibility that's common when dealing with a large number of identically-sized items (especially small ones) is an array of blocks, with a bitset to specify which are free or in use. In this case, you typically keep track of an index into the bitset where the last free block was found. When a block is needed, just search forward from the last index until you find the next one for which the bitset says the block is free.
Heap has two main meanings with two different concepts:
A data structure for arrange data, a tree.
A pool of available memory. It manages memory by allocate/free available memory blocks.
An introduction about Heap in Memory Management:
The heap is the other dynamic memory area, allocated/freed by
malloc/free and their variants....
Here heap doesn't mean the heap data structure. That memory is termed as heap memory where global, static variables are stored. Also when we allocate memory dynamically, it is allocated in the heap memory.
When you read "allocated on the heap", that's usually an implementation-specific implementation of C++'s "dynamic storage duration". It means you've used new to allocate memory, and have a pointer to it that you now have to keep track of til you delete (or delete[]) it.
As for how it's organized, there's no one set way of doing so. At one point, if memory serves, the "heap" used to actually be a min-heap (organized by block size). That's implementation specific, though, and doesn't have to be that way. Any data structure will do, some better than others for given circumstances.
I have operator new() replaced in my C++ program so that it allocates a slightly bigger block to store extra data. So the program performs exactly the same set of allocations except that now it requests several bytes more memory in each allocation. Otherwise its behavior is completely the same and it processes exactly same data. The program allocates lots of blocks (millions, I suppose) of various sizes during its runtime.
How will increasing each allocation size by a fixed number of bytes (same for every allocation) affect heap fragmentation?
Unless your program uses some "edge" block sizes (say, near to a power of two), I don't see that block size (or a small difference in block size compared to the program with standard allocation) may affect fragmentation. With millions of allocation, a good allocator fills up the space and manages it efficiently.
Thinking the other way around, imagine your program originally used the blocks of the sizes the same as the one with the modified allocator. Would you then bother about memory fragmentation in that case?
Heaps are normally implemented as linked lists of cell. On application startup there is only one large cell. Your first allocation breaks off a small piece at the beginning to create a new allocated heap cell. Subsequent allocations do the same. After a while some cells are freed leaving free holes between allocated block.
After running a while, when you request an allocation, the allocator walks the heap until it finds a free cell of equal size to that requested or bigger. Rounding up to larger cell allocation sizes may require more memory up front but increases the likelyhood of finding a suitable free cells meaning that new memory does not have to be added to the end of the heap. This may improve performance.
However, bear in mind heap operations are expensive and therefore should be minimized. You are most probably allocating and deallocating objects of the same type and therefore same size. Look into using specialized free lists for your object. This will save the heap operation and thus mimimize fragmentation. The STL has allocators for this very reason.
It depends on the implementation driving the memory allocator, for instance:
On widows, it pulls memory from the process heap, under under XP, this heap its not set to be the low fragmentation implementation, which could really throw a spanner in the works.
Under a bin or slab based allocator, your few extra bytes might push it up to the next block size, wasting memory madly and causing horrible virtual memory thrashing.
Depending on your memory usage needs, you might be better served by using a custom allacator to replace ::new, something like hoard or nedmalloc.
If your blocks (allocated and deallocated memory) are still smaller that a C library allocator handles without problems with fragmentation than you must not face any memory fragmentation. For example take a look at my own question about allocators: Small block allocator on Linux (or RedHat Linux) to avoid memory fragmentation.
In other words. You have implented your own ::operator new() and in it you call malloc() and pass a slighly bigger block size. malloc() is in a C library and it is responsible not only for allocating and deallocating but also for avoiding memory fragmentation. If you do not frequently allocate and free blocks with sizes bigger than the allocator can handle efficently then you can expect that there will be on memory fragmentation.