What is slower about dynamic memory usage? [duplicate] - c++

This question already has answers here:
Which is faster: Stack allocation or Heap allocation
(24 answers)
Closed 9 years ago.
I know that its faster to allocate memory on the stack than on the heap, but why is heap memory allocation slower? Is it because stack allocation is continuous and therefore the issue arises because of cache locality? Is it not the usage of the memory after it has been allocated, it is the time taken to allocate which is slower?

Caching issues aside, the CPU stack is just that, a stack, a LIFO list/queue. You remove things from it in the exactly opposite order from the one you put them there. You do not create holes in it by removing something in the middle of it. This makes its management extremely trivial:
memory[--stackpointer] = value; // push
value = memory[stackpointer++]; // pop
Or you could allocate a large chunk:
stackpointer -= size; // allocate
memset(&memory[stackpointer], 0, size); // use
and free it likewise:
stackpointer += size; // free
Your heap, OTOH, does not have the LIFO property. And for that reason it must keep track of all allocated blocks individually. Which means, it has to have some kind of list of free blocks and a list of allocated blocks and it needs to look for big enough blocks when allocating and look for the specified block when freeing and then likely do some block splitting and coalescing in the process. The simple stack does not need to do any of this.
This alone is a significant algorithmic difference between two ways of allocation and deallocation.
Caching and explicit calls to map physical memory into the virtual address space add up as well, but if you consider them to be equal in both cases, you still have a few instructions vs a few dozen to a few hundred instructions of difference.

"Better" may not a good way to describe it, but it usually is "Faster" to allocate memory on the stack, as opposed to on the heap. You are correct that it is the allocation of the memory which is slower, not the use of that memory afterwards.
The reason that heap allocation tends to be slower is that heap managers need to do additional work: they often try to find a block of existing memory that closely approximates the size you are requesting, and when freeing blocks, they typically check adjoining memory areas to see if they can be merged. Stack allocation is simply adding a value to a pointer, nothing more.

Related

Dynamic allocation store data in random location in the heap?

I know that local variables will be stored on the stack orderly.
but, when i dynamically allocate variable in the heap memory in c++ like this.
int * a = new int{1};
int * a2 = new int{2};
int * a3 = new int{3};
int * a4 = new int{4};
Question 1 : are these variable stored in contiguous memory location?
Question 2 : if not, is it because dynamic allocation store variables in random location in the heap memory?
Question3 : so does dynamic allocation increase possibility of cache miss and has low spatial locality?
Part 1: Are separate allocations contiguous?
The answer is probably not. How dynamic allocation occurs is implementation dependent. If you allocate memory like in the above example, two separate allocations might be contiguous, but there is no guarantee of this happening (and it should never be relied on to occur).
Different implementations of c++ use different algorithms for deciding how memory is allocated.
Part 2: Is allocation random?
Somewhat; but not entirely. Memory doesn’t get allocated in an intentionally random fashion. Oftentimes memory allocators will try to allocate blocks of memory near each other in order to minimize page faults and cache misses, but it’s not always possible to do so.
Allocation happens in two stages:
The allocator asks for a large chunk of memory from the OS
The takes pieces of that large chunk and returns them whenever you call new, until you ask for more memory than it has to give, in which case it asks for another large chunk from the OS.
This second stage is where an implementation can make attempts to give things you memory that’s near other recent allocations, however it has little control over the first stage (and the OS usually just provides whatever memory is available, without any knowledge of other allocations by your program).
Part 3: avoiding cache misses
If cache misses are a bottleneck in your code,
Try to reduce the amount of indirection (by having arrays store objects by value, rather than by pointer);
Ensure that the memory you’re operating on is as contiguous as the design permits (so use a std::array or std::vector, instead of a linked list, and prefer a few big allocations to lots of small ones); and
Try to design the algorithm so that it has to jump around in memory as little as possible.
A good general principle is to just use a std::vector of objects, unless you have a good reason to use something fancier. Because they have better cache locality, std::vector is faster at inserting and deleting elements than std::list, even up to dozens or even hundreds of elements.
Finally: try to take advantage of the stack. Unless there’s a good reason for something to be a pointer, just declare as a variable that lives on the stack. When possible,
Prefer to use MyClass x{}; instead of MyClass* x = new MyClass{};, and
Prefer std::vector<MyClass> instead of std::vector<MyClass*>.
By extension, if you can use static polymorphism (i.e, templates), use that instead of dynamic polymorphism.
IMHO this is Operating System specific / C++ standard library implementation.
new ultimately uses lower-level virtual memory allocation services and allocating several pages at once, using system calls like mmap and munmap. The implementation of new could reuse previously freed memory space when relevant.
The implementation of new could use various and different strategies for "large" and "small" allocations.
In the example you gave the first new results in a system call for memory allocation (usually several pages), the allocated memory could be large enough so that subsequent new calls results in contiguous allocation..But this depends on the implementation
In short:
not at all (there is padding due to alignment, heap housekeeping data, allocated chunks may be reused, etc.),
not at all (AFAIK, heap algorithms are deterministic without any randomness),
generally yes (e.g., memory pooling might help here).

How C++ heap managers track the size of allocated objects?

I want to estimate memory consumption in C++ when I allocate objects in the heap. I can start my calculation with sizeof(object) and round it up to the nearest multiple of the heap block (typically 8 bytes). But if the whole allocated block goes to the allocated object, how can the heap manager tell the size of the object when I ask it to delete the pointer?
If the heap manager tracks the size of each object, does it mean that I should add ~4 bytes per allocated object in my calculation to the total memory consumption for heap manager internal expenses? Or does it store this information in much more compact form? What are extra costs (memory-wise) of heap memory allocation?
I understand that my question is very implementation specific, but I appreciate any hints about the heap metadata storage of major implementations such as gcc (or maybe it is about libc).
Heap allocators are not free. There is the cost per block of allocation (both in size and perhaps search if using a find best algorithm), the cost of joining blocks upon free, and any lost size per block when the size requested is smaller than the block size returned. There is also a cost in terms of memory fragmentation. Consider placing a small 1 byte allocation in the middle of your heap. At this point, you can no longer return a contiguous block larger than 1/2 the heap - half your heap is fragmented. Good allocators fight all the above and try to maximize all benefits.
Consider the following allocation system (used in many real world applications over a decade on numerous handheld game devices.)
Create a primary heap where each allocation has a prev ptr, next ptr, size, and possibly other info. Round it to 16 bytes per entry. Store this info before or after the actual memory pointer being returned - your choice as there are advantages to each. Yes, you are allocating requested size + 16 bytes here.
Now just keep a pointer to a free list and possibly a used list.
Allocation is done by finding a block on the free list big enough for our use and splitting it into the size requested and the remainder (first fit), or by searching the whole list for as exact a match as possible (best fit). Simple enough.
Freeing is moving the current item back into the free list, joining areas adjacent to each other if possible. You can see how this can go to O(n).
For smaller allocations, get a single allocation (either from newly created heap, or from global memory) that will be your unit allocation zone. Split this area up into "block size" chunk addresses and push these addresses onto a free stack. Allocation is popping an address from this list. Freeing is just pushing the allocation back to the list - both O(1).
Then inside your malloc/new/etc, check size if it is within the unit size, allocate from the unit allocator, otherwise use the O(n) allocator. My studies have shown that you can get 90-95% of allocations to fit within the unit allocator's block size without too much issue.
Also, you can allocate slabs of memory for memory pools and leave them allocated while using them over and over. A few larger allocations are much cheaper to manage (Unix systems use this alot...)
Advantages:
All unit allocations have no external fragmentation.
Small allocations run constant time.
Larger allocation can be tailored to the exact size of the data requested.
Disadvantages:
You pay an internal fragmentation cost when you want memory smaller than the block requsted.
Many, many allocations and frees outside your small block size will eventually fragment your memory. This can lead to all sorts of unpleasantness. This can be managed with OS help or with care on alloc/free order.
Once you get 1000s of large allocations going, there is a CPU price. Slabs / multiple heaps can be used to fix this as well.
There are many, many schemes out there, but this one is simple and has been used in commercial apps for ages though, so I wanted to start here.

Why is memory not reusable after allocating/deallocating a number of small objects?

While investigating a memory link in one of our projects, I've run into a strange issue. Somehow, the memory allocated for objects (vector of shared_ptr to object, see below) is not fully reclaimed when the parent container goes out of scope and can't be used except for small objects.
The minimal example: when the program starts, I can allocate a single continuous block of 1.5Gb without problem. After I use the memory somewhat (by creating and destructing an number of small objects), I can no longer do big block allocation.
Test program:
#include <iostream>
#include <memory>
#include <vector>
using namespace std;
class BigClass
{
private:
double a[10000];
};
void TestMemory() {
cout<< "Performing TestMemory"<<endl;
vector<shared_ptr<BigClass>> list;
for (int i = 0; i<10000; i++) {
shared_ptr<BigClass> p(new BigClass());
list.push_back(p);
};
};
void TestBigBlock() {
cout<< "Performing TestBigBlock"<<endl;
char* bigBlock = new char [1024*1024*1536];
delete[] bigBlock;
}
int main() {
TestBigBlock();
TestMemory();
TestBigBlock();
}
Problem also repeats if using plain pointers with new/delete or malloc/free in cycle, instead of shared_ptr.
The culprit seems to be that after TestMemory(), the application's virtual memory stays at 827125760 (regardless of number of times I call it). As a consequence, there's no free VM regrion big enough to hold 1.5 GB. But I'm not sure why - since I'm definitely freeing the memory I used. Is it some "performance optimization" CRT does to minimize OS calls?
Environment is Windows 7 x64 + VS2012 + 32-bit app without LAA
Sorry for posting yet another answer since I am unable to comment; I believe many of the others are quite close to the answer really :-)
Anyway, the culprit is most likely address space fragmentation. I gather you are using Visual C++ on Windows.
The C / C++ runtime memory allocator (invoked by malloc or new) uses the Windows heap to allocate memory. The Windows heap manager has an optimization in which it will hold on to blocks under a certain size limit, in order to be able to reuse them if the application requests a block of similar size later. For larger blocks (I can't remember the exact value, but I guess it's around a megabyte) it will use VirtualAlloc outright.
Other long-running 32-bit applications with a pattern of many small allocations have this problem too; the one that made me aware of the issue is MATLAB - I was using the 'cell array' feature to basically allocate millions of 300-400 byte blocks, causing exactly this issue of address space fragmentation even after freeing them.
A workaround is to use the Windows heap functions (HeapCreate() etc.) to create a private heap, allocate your memory through that (passing a custom C++ allocator to your container classes as needed), and then destroy that heap when you want the memory back - This also has the happy side-effect of being very fast vs delete()ing a zillion blocks in a loop..
Re. "what is remaining in memory" to cause the issue in the first place: Nothing is remaining 'in memory' per se, it's more a case of the freed blocks being marked as free but not coalesced. The heap manager has a table/map of the address space, and it won't allow you to allocate anything which would force it to consolidate the free space into one contiguous block (presumably a performance heuristic).
There is absolutely no memory leak in your C++ program. The real culprit is memory fragmentation.
Just to be sure(regarding memory leak point), I ran this program on Valgrind, and it did not give any memory leak information in the report.
//Valgrind Report
mantosh#mantosh4u:~/practice$ valgrind ./basic
==3227== HEAP SUMMARY:
==3227== in use at exit: 0 bytes in 0 blocks
==3227== total heap usage: 20,017 allocs, 20,017 frees, 4,021,989,744 bytes allocated
==3227==
==3227== All heap blocks were freed -- no leaks are possible
Please find my response to your query/doubt asked in original question.
The culprit seems to be that after TestMemory(), the application's
virtual memory stays at 827125760 (regardless of number of times I
call it).
Yes, real culprit is hidden fragmentation done during the TestMemory() function.Just to understand the fragmentation, I have taken the snippet from wikipedia
"
when free memory is separated into small blocks and is interspersed by allocated memory. It is a weakness of certain storage allocation algorithms, when they fail to order memory used by programs efficiently. The result is that, although free storage is available, it is effectively unusable because it is divided into pieces that are too small individually to satisfy the demands of the application.
For example, consider a situation wherein a program allocates 3 continuous blocks of memory and then frees the middle block. The memory allocator can use this free block of memory for future allocations. However, it cannot use this block if the memory to be allocated is larger in size than this free block."
The above explains paragraph explains very nicely about memory fragmentation.Some allocation patterns(such as frequent allocation and deal location) would lead to memory fragmentation,but its end impact(.i.e. memory allocation 1.5GBgets failed) would greatly vary on different system as different OS/heap manager has different strategy and implementation.
As an example, your program ran perfectly fine on my machine(Linux) however you have encountered the memory allocation failure.
Regarding your observation on VM size remains constant: VM size seen in task manager is not directly proportional to our memory allocation calls. It mainly depends on the how much bytes is in committed state. When you allocate some dynamic memory(using new/malloc) and you do not write/initialize anything in those memory regions, it would not go committed state and hence VM size would not get impacted due to this. VM size depends on many other factors and bit complicated so we should not rely completely on this while understanding about dynamic memory allocation of our program.
As a consequence, there's no free VM regrion big enough to hold 1.5
GB.
Yes, due to fragmentation, there is no contiguous 1.5GB memory. It should be noted that total remaining(free) memory would be more than 1.5GB but not in fragmented state. Hence there is not big contiguous memory.
But I'm not sure why - since I'm definitely freeing the memory I used.
Is it some "performance optimization" CRT does to minimize OS calls?
I have explained about why it may happen even though you have freed all your memory. Now in order to fulfil user program request, OS will call to its virtual memory manager and try to allocate the memory which would be used by heap memory manager. But grabbing the additional memory does depend on many other complex factor which is not very easy to understand.
Possible Resolution of Memory Fragmentation
We should try to reuse the memory allocation rather than frequent memory allocation/free. There could be some patterns(like a particular request size allocation in particular order) which may lead overall memory into fragmented state. There could be substantial design change in your program in order to improve memory fragmentation. This is complex topic and require internal understanding of memory manager to understand the complete root cause of such things.
However there are tools exists on Windows based system which I am not much aware. But I found one excellent SO post regarding the which tool(on windows) can be useful to understand and check the fragmentation status of your program by yourself.
https://stackoverflow.com/a/1684521/2724703
This is not memory leak. The memory U used was allocated by C\C++ Runtime. The Runtime apply a a bulk of memory from OS once and then each new you called will allocated from that bulk memory. when delete one object, the Runtime not return memory to OS immediately, it may hold that memory for performance.
There is nothing here which indicates a genuine "leak". The pattern of memory you describe is not unexpected. Here are a few points which might help to understand. What happens is highly OS dependent.
A program often has a single heap which can be extended or shrunk in length. It is however one contiguous memory area, so changing the size is just changing where the end of the heap is. This makes it very difficult to ever "return" memory to the OS, since even one little tiny object in that space will prevent its shrinking. On Linux you can lookup the function 'brk' (I know you're on Windows, but I presume it does something similar).
Large allocations are often done with a different strategy. Rather than putting them in the general purpose heap, an extra block of memory is created. When it is deleted this memory can actually be "returned" to the OS since its guaranteed nothing is using it.
Large blocks of unused memory don't tend to consume a lot of resources. If you generally aren't using the memory any more they might just get paged to disk. Don't presume that because some API function says you're using memory that you are actually consuming significant resources.
APIs don't always report what you think. Due to a variety of optimizations and strategies it may not actually be possible to determine how much memory is in use and/or available on a system at a particular moment. Unless you have intimate details of the OS you won't know for sure what those values mean.
The first two points can explain why a bunch of small blocks and one large block result in different memory patterns. The latter points indicate why this approach to detecting leaks is not useful. To detect genuine object-based "leaks" you generally need a dedicated profiling tool which tracks allocations.
For example, in the code provided:
TestBigBlock allocates and deletes array, assume this uses a special memory block, so memory is returned to OS
TestMemory extends the heap for all the small objects, and never returns any heap to the OS. Here the heap is entirely available from the applications point-of-view, but from the OS's point of view it is assigned to the application.
TestBigBlock now fails, since although it would use a special memory block, it shares the overall memory space with heap, and there just isn't enough left after 2 is complete.

C++ When to allocate on heap vs stack?

Whilst asking another question (and also before) I was wondering how do I judge whether to create an object on the heap or keep it as an object on the stack? What should I ask myself about the object to make the correct allocation?
Put it on the heap if you have to, the stack if you can.
What kinds of things do you need to put on the heap? Anything of varying length. Any object that might need to be null. Anything that's very large, lest you cause a stack overflow.
Simple answer.
When it goes out of scope, do you want it to hang around and be able to use it?
Depends on intended lifetime of the object.
If you want the object to be alive even after function returns, then HEAP, else STACK
If an object is placed in the HEAP, then it must be explicitly free()'ed or deleted by the programmer, once its usage is over; otherwise the program will be leaking memory.
Stack memory is fast. It is fast because (a) there is no system overhead to allocate the memory - the allocation is done by simply moving the stack pointer in one instruction and (b) the memory in the stack is "hot" so it is already in cache. Heap memory is slow because (a) it requires a lot of system work to look around and find a free chunk of memory and (b) is probably not in cache and will require evicting some data you might have wanted.
Stack memory doesn't get fragmented. It is possible that a heap eventually gets so fragmented, you can't allocate anything (even though ironically there is still enough unused memory!)
For long lived data and for large data (multi KB or more), you have to use a heap.
The danger of allocating a bigger stack is that it might hurt you if are running multiple threads. You have to size the stack for the "worst case" usage. Each thread requires its own stack. On a high core count machine (where you might have 200+ threads running), you may not want to arbitrarily increase the stack. The heap on the other hand does not need to be sized for "worst case" usage - it is much more efficient.
Two reasons to use the heap:
1- You want the data after the current scope.
2- You want to reserve large memory.
Other than that stay on stack.
Note: don't reserve a lot of memory on the stack, or you'll get a "Stack-overflow" ;)

How will increasing each memory allocation size by a fixed number of bytes affect heap fragmentation?

I have operator new() replaced in my C++ program so that it allocates a slightly bigger block to store extra data. So the program performs exactly the same set of allocations except that now it requests several bytes more memory in each allocation. Otherwise its behavior is completely the same and it processes exactly same data. The program allocates lots of blocks (millions, I suppose) of various sizes during its runtime.
How will increasing each allocation size by a fixed number of bytes (same for every allocation) affect heap fragmentation?
Unless your program uses some "edge" block sizes (say, near to a power of two), I don't see that block size (or a small difference in block size compared to the program with standard allocation) may affect fragmentation. With millions of allocation, a good allocator fills up the space and manages it efficiently.
Thinking the other way around, imagine your program originally used the blocks of the sizes the same as the one with the modified allocator. Would you then bother about memory fragmentation in that case?
Heaps are normally implemented as linked lists of cell. On application startup there is only one large cell. Your first allocation breaks off a small piece at the beginning to create a new allocated heap cell. Subsequent allocations do the same. After a while some cells are freed leaving free holes between allocated block.
After running a while, when you request an allocation, the allocator walks the heap until it finds a free cell of equal size to that requested or bigger. Rounding up to larger cell allocation sizes may require more memory up front but increases the likelyhood of finding a suitable free cells meaning that new memory does not have to be added to the end of the heap. This may improve performance.
However, bear in mind heap operations are expensive and therefore should be minimized. You are most probably allocating and deallocating objects of the same type and therefore same size. Look into using specialized free lists for your object. This will save the heap operation and thus mimimize fragmentation. The STL has allocators for this very reason.
It depends on the implementation driving the memory allocator, for instance:
On widows, it pulls memory from the process heap, under under XP, this heap its not set to be the low fragmentation implementation, which could really throw a spanner in the works.
Under a bin or slab based allocator, your few extra bytes might push it up to the next block size, wasting memory madly and causing horrible virtual memory thrashing.
Depending on your memory usage needs, you might be better served by using a custom allacator to replace ::new, something like hoard or nedmalloc.
If your blocks (allocated and deallocated memory) are still smaller that a C library allocator handles without problems with fragmentation than you must not face any memory fragmentation. For example take a look at my own question about allocators: Small block allocator on Linux (or RedHat Linux) to avoid memory fragmentation.
In other words. You have implented your own ::operator new() and in it you call malloc() and pass a slighly bigger block size. malloc() is in a C library and it is responsible not only for allocating and deallocating but also for avoiding memory fragmentation. If you do not frequently allocate and free blocks with sizes bigger than the allocator can handle efficently then you can expect that there will be on memory fragmentation.