I have an application where the sequence of malloc/free operations is known in advance. I'd like to do a pre-computation to minimize the maximum memory usage. Are there any resources on that (c++ implementations/research papers)?
More precisely, the same sequence of malloc/free operations is repeated many times (in the end of each cycle everything is freed). So I can afford some computation to optimize memory usage.
Assuming what you want to achieve is minimize time spent allocating memory and possibly improve cache locality, this sounds quite simple, actually.
Just choose memory manager (write one or use a pre-existing such as Hoard). Then, let the memory manager allocate the maximum amount of memory used during a cycle at the start of the program.
The main issue is calculating this amount of memory. A simple solution would be to go through one cycle using an allocator which does nothing other than wrap malloc/free together with a counter which keeps track of current memory usage and maximum usage. At the end of your cycle, that maximum is how much you should allocate in the beginning.
The one thing to look out for is that fragmentation in your allocated memory could cause the need for extra allocations. This can usually be avoided by a good memory manager. In the worst case, you may have to track max memory allocated for every allocation size separately.
As a sidenote, if you are using C++, why are you using malloc/free instead of new/delete?
More precisely, the same sequence of malloc/free operations is
repeated many times (in the end of each cycle everything is freed). So
I can afford some computation to optimize memory usage.
For memory usage, this is not a hard case to solve. The same memory will be reallocated for the same purpose, so it's not "wasteful" of memory if you allocate the same lumps of memory over and over again.
Since you are saying, malloc and free, are we talking old style "C" type usage of the heap? So there's no constructors or destructors to worry about? Why not then create arrays of elements of a given type, e.g.
struct X
{
...
};
Old code:
X* px[10];
for(i = 0; i < 10; i++)
{
px[i] = malloc(sizeof(X));
...
}
instead do:
X* px[10];
X* xx = malloc(sizeof(X)*10);
for(i = 0; i < 10; i++)
{
px[i] = &xx[i];
}
Related
I know that local variables will be stored on the stack orderly.
but, when i dynamically allocate variable in the heap memory in c++ like this.
int * a = new int{1};
int * a2 = new int{2};
int * a3 = new int{3};
int * a4 = new int{4};
Question 1 : are these variable stored in contiguous memory location?
Question 2 : if not, is it because dynamic allocation store variables in random location in the heap memory?
Question3 : so does dynamic allocation increase possibility of cache miss and has low spatial locality?
Part 1: Are separate allocations contiguous?
The answer is probably not. How dynamic allocation occurs is implementation dependent. If you allocate memory like in the above example, two separate allocations might be contiguous, but there is no guarantee of this happening (and it should never be relied on to occur).
Different implementations of c++ use different algorithms for deciding how memory is allocated.
Part 2: Is allocation random?
Somewhat; but not entirely. Memory doesn’t get allocated in an intentionally random fashion. Oftentimes memory allocators will try to allocate blocks of memory near each other in order to minimize page faults and cache misses, but it’s not always possible to do so.
Allocation happens in two stages:
The allocator asks for a large chunk of memory from the OS
The takes pieces of that large chunk and returns them whenever you call new, until you ask for more memory than it has to give, in which case it asks for another large chunk from the OS.
This second stage is where an implementation can make attempts to give things you memory that’s near other recent allocations, however it has little control over the first stage (and the OS usually just provides whatever memory is available, without any knowledge of other allocations by your program).
Part 3: avoiding cache misses
If cache misses are a bottleneck in your code,
Try to reduce the amount of indirection (by having arrays store objects by value, rather than by pointer);
Ensure that the memory you’re operating on is as contiguous as the design permits (so use a std::array or std::vector, instead of a linked list, and prefer a few big allocations to lots of small ones); and
Try to design the algorithm so that it has to jump around in memory as little as possible.
A good general principle is to just use a std::vector of objects, unless you have a good reason to use something fancier. Because they have better cache locality, std::vector is faster at inserting and deleting elements than std::list, even up to dozens or even hundreds of elements.
Finally: try to take advantage of the stack. Unless there’s a good reason for something to be a pointer, just declare as a variable that lives on the stack. When possible,
Prefer to use MyClass x{}; instead of MyClass* x = new MyClass{};, and
Prefer std::vector<MyClass> instead of std::vector<MyClass*>.
By extension, if you can use static polymorphism (i.e, templates), use that instead of dynamic polymorphism.
IMHO this is Operating System specific / C++ standard library implementation.
new ultimately uses lower-level virtual memory allocation services and allocating several pages at once, using system calls like mmap and munmap. The implementation of new could reuse previously freed memory space when relevant.
The implementation of new could use various and different strategies for "large" and "small" allocations.
In the example you gave the first new results in a system call for memory allocation (usually several pages), the allocated memory could be large enough so that subsequent new calls results in contiguous allocation..But this depends on the implementation
In short:
not at all (there is padding due to alignment, heap housekeeping data, allocated chunks may be reused, etc.),
not at all (AFAIK, heap algorithms are deterministic without any randomness),
generally yes (e.g., memory pooling might help here).
I'm working on a Windows environment with Visual Studio 2013.
In my application, I need to store the addresses of the new-ly created objects.
Something simple as this allocates ~7.6MB of memory and deallocates them as expected, program starts with memory usage ~0.4MB:
double* dptr = new double[1000000]; // allocates 8*1000000 ~7.6MB
delete[] dptr;
after the last line, memory usage goes back to 0.4MB.
But the problem occurs when I try to do something like this:
// addresses buffer for objects of type 'double'
double** dpptr = new double*[1000000];
// instantiate objects and store their addresses in the buffer
for (int i = 0; i < 1000000; i++)
dpptr[i] = new double;
/**** Problem 1 ****/
// --- time passes ---
// delete each object
for (int i = 0; i < 1000000; i++)
delete dpptr[i];
// delete the addresses buffer
delete[] dpptr;
/**** Problem 2 ****/
On my machine sizeof(double*) = 4 and sizeof(double) = 8, so if I'm doing the maths correctly:
Size of the addresses buffer = 4 * 1000000 ~3.8MB
Size of all double objects = 8 * 1000000 ~7.6MB
Total memory usage should be 11.4MB
But when i run this program I face 2 problems:
Before deleting each object (after instantiating them), the memory usage goes up to ~72.9MB rather than 11.4MB
After the last line, memory usage goes down to ~6.1MB rather than 0.4MB as stated above.
Why is this causing a huge memory usage over the expected ?
Also what's the reason of the memory leak at the end of the program ?
Dynamic memory allocations have overhead. The overhead is proportionally larger for small allocations. 8 bytes is quite a small individual allocation. (If implemented with malloc) Dynamic memory allocations must be aligned to the largest native alignment. That is 16 bytes on my system. That would explain a minimum of 100% overhead for an 8 byte allocation. There will be some bookkeeping overhead as well to keep track of all allocations. The amount of bookkeping information is proportional to the number of allocations - which is much greater in your latter example than in the former. I'm not sure if this explains all of your memory use, but 11.4MB is certainly an unrealistic expectation.
There probably was no memory leak. More likely just your expectation that the process would release the memory to the operating system was simply misguided. The memory allocation provided by the runtime library is typically implemented so that it often doesn't release the memory. This is especially likely for small deallocations, and deallocations that weren't done in LIFO order - although whether either has an effect will depend on the implementation.
The allocation system keeps the freed memory for the process, and reuses it for future allocations. To see if there really was a leak, repeat the allocations and see if the memory use doubles from the previous.
I'm having memory problems with in implementing a vector< list< size_t> >. Here's the code:
struct structure{
vector< list<size_t> > Ar;
structure(int n){
Ar.reserve(n);
for(size_t i =0; (int) i< n;i++){
list<size_t> L;
Ar.push_back(L);
}
}
~structure(){
vector<list<size_t> >::iterator it = Ar.begin();
while(it < Ar.end()){
(*it).clear();
++it;
}
Ar.clear();
}
};
int main() {
for(size_t k = 0; k <100 ; k++){
structure test_s = structure(1000*(100 - k));
}
return 0;
}
The physical memory allocation to this program should be going down as time progresses, because less and less memory is being allocated to test_s, through the use of 100 - k in the constructor. This isn't what I'm observing! Rather, the physical memory increases around half way through the run!
Since I am using this code in a bigger program that eats up a huge amount of memory, this is a bit of a catastrophe!
There are two details that I find strange:
Firstly, there is no progressive increment in the physical memory usage, even though the size of the object changes at every stage of the for loop, rather the memory increases suddenly at around the 50th iteration of the for loop. This happens consistently each run I do (and I've done many!). The iteration at which the memory increases is not random!
Secondly, when I pass a static size_t (e.g. 10000) to the structure(size_t) constructor, I don't get the problem anymore. As you can probably guess, a static value isn't very useful for my code, as I need to be able to dynamically allocate the size of the structure object.
I am compiling with g++ on macos 10.8.3. I haven't tried compiling on another platform, as I would prefer to keep working on my Mac.
All of the memory management tools I have tried (Apple Instruments and Valgrind) haven't been particularly helpful. Valgrind only returns references to libraries and not to the program itself.
Any help would be much appreciated!!
Cheers,
Plamen
The C++ allocator doesn't necessarily return memory to the OS when it's done with it, but usually keeps it around since you're probably going to need it soon.
I'm not familiar with the details of the OS X allocator, but it's quite common for an allocator to grab memory from the OS in larger chunks and then treat them as separate pools.
This may be what you're seeing, with sudden growth as the first chunk of memory is filled.
It's also possible that you're passing some threshold between "larger" allocations and "smaller" allocations, and you're just seeing an added pool for slightly smaller things - some allocators do that.
It's also possible that the cause is something entirely different, of course.
The difference when you're using the same size for each one is most likely because it's easy for the allocator to fill the request using a block of the same size that was recently freed.
When the blocks have different sizes it's faster to allocate a new block with a different size than to divide a free block into two smaller ones.
(This can also unfortunately lead to memory fragmentation. If you get many scattered small blocks, it may happen that a large allocation request can't be fulfilled despite there being enough room in total.)
In summary: Memory allocators and operating systems are quite complicated these days, and you can't look at growth in memory allocation and say for certain that you have a memory leak.
I would trust valgrind and Instruments in your case.
I don't see any leak in that code, but a lot of unneeded code. The simplified code would be:
struct structure{
vector< list<size_t> > Ar;
structure(int n): Ar(n) // initialize the vector with n empty lists
{
}
// destructor unneeded, since Ar will be destroyed anyway, with all of its elements.
};
But this doesn't answer your question.
Heap memory allocation doesn't means physical memory allocation. Modern OS use virtual memory, usually backed by paging file. Heap allocation get memory from virtual memory and OS decide if more or less physical memory is needed. Freeing memory to the virtual memory doesn't means free that physical memory (if is not needed for other process, why to do that in that time?).
Also, heap memory allocation are not directly translated to virtual memory allocations. Usually, virtual memory allocation have a big granularity so is not suitable for small allocations. Then, the heap manager allocate blocks of virtual memory and manage that for all heap allocations. (If virtual memory is not enough, heap manager will ask for more). When not used blocks of virtual memory are freed depends on how is heap manager implemented.
To do the things a bit more complex, allocating and deallocating different size of memory would produce heap fragmentation, depending on allocation/deallocation pattern and how is the heap implemented.
Physical memory is not a good indicator of memory leak in this type of program. Would be better private (virtual) memory or similar one.
Are there large gains in PERFORMANCE to be made by allocating heap memory in advance and filling it incrementally?
Consider the VERY simplified example below:
byte * heapSpace = malloc (1 000 000);
int currentWriteSpot = 0;
struct A {
int x;
byte * extraSpace;
int extraSpaceLength;
};
//a1 needs 10 bytes of extra storage space:
A a1;
a1.x = 2;
a1.extraSpace = heapSpace + currentWriteSpot;
a1.extraSpaceLength = 10;
currentWriteSpot += 10;
//a2 needs 120 bytes of extra storage space:
A a2;
a2.x = 24;
a2.extraSpace = heapSpace + currentWriteSpot;
a2.extraSpaceLength = 120;
currentWriteSpot += 120;
// ... many more elements added
for ( ... ) {
//loop contiguously over the allocated elements, manipulating contents stored at "extraSpace"
}
free (heapSpace);
VS:
...
a1.extraSpace = malloc ( 10 );
a2.extraSpace = malloc ( 120 );
a3...
a4...
...
//do stuff
free (a1.extraSpace);
free (a2.extraSpace);
free ...
free ...
free ...
Or is this likely to simply add complexity without significant gains in performance?
Thanks folks!
First of all, doing this does not increase complexity; it decreases it. Because you have already determined at the beginning of your operation that malloc was successful, you don't need any further checks for failure, which would at least have to free the allocations already made and perhaps reverse other changes to various objects' states.
One of the other benefits, as you've noted, is performance. This will be a much bigger issue in multi-threaded programs where calls to malloc could result in lock contention.
Perhaps a more important benefit is avoiding fragmentation. If the entire data object is allocated together rather than in small pieces, freeing it will definitely return usable contiguous space of the entire size to the free memory pool to be used by later allocations. On the other hand, if you allocate each small piece separately, there's a good possibility that they won't be contiguous.
In addition to reducing fragmentation, allocating all the data as a single contiguous block also avoids per-allocation overhead (at least 8-16 bytes per allocation are wasted) and improves data locality for cache purposes.
By the way, if you're finding this sort of allocation strategy overly complex, you might try making a few functions to handle it for you, or using an existing library like GNU obstack.
The reason you would want to do this is if you need to guarantee consistent allocation times (where 'consistent' != 'fastest'). The biggest example is the draw loop of a game or other paint operation - it's far more important for it not to "hiccup" than getting an extra 2 FPS at the expense of consistency.
If all you want is to complete an operation as fast as possible, the Win7 LFH is quite fast, and is already doing this optimization for you (this tip is from the days back when the heap manager typically sucked and was really slow). That being said, I could be completely wrong - always profile your workload and see what works and what doesn't.
Generally it is best to let the memory manager do this kind of thing, but in some extreme cases (eg. LOTS of small allocates and de-allocates) can be better handled using your own implementation. Ie. you grab one big chunk of memory and allocated/deallocate as required. Generally such cases are going to be very simplified cases (eg. you own sparse matrix implementation) where you can apply domain-specific optimizations that the standard memory manager cannot do. Eg. in the sparse matrix example, each chunk of memory is going to be the same size. This makes garbage collection relatively simple - chunks of deallocated memory do not need to be joined - just a simple "in use" flag is required, etc,etc.
You should only request to the memory manager for as many blocks of memory as you need to be separately controllable- in an ideal world where we have infinite optimization time, of course. If you have several A objects that do not need to be lifetimed separately, then do not allocate them separately.
Of course, whether or not this is actually worth the more intensive optimization time, is another question.
I was trying to look at the behavior of the new allocator and why it doesn't place data contiguously.
My code:
struct ci {
char c;
int i;
}
template <typename T>
void memTest()
{
T * pLast = new T();
for(int i = 0; i < 20; ++i) {
T * pNew = new T();
cout << (pNew - pLast) << " ";
pLast = pNew;
}
}
So I ran this with char, int, ci. Most allocations were a fixed length from the last, sometimes there were odd jumps from one available block to another.
sizeof(char) : 1
Average Jump: 64 bytes
sizeof(int): 4
Average Jump: 16
sizeof(ci): 8 (int has to be placed on a 4 byte align)
Average Jump: 9
Can anyone explain why the allocator is fragmenting memory like this? Also why is the jump for char so much larger then ints and a structure that contains both an int and char.
There are two issues:
most allocators store some additional data prior to the start of the block (typically block size and a couple of pointers)
there are usually alignment requirements - modern operating systems typically allocate to at least an 8 byte boundary.
So you'll nearly always get some kind of gap between successive allocations.
Of course you should never rely on any specific behaviour for something like this, where the implementation is free to do as it pleases.
Your code contains a bug, to know distance of pointers cast them to (char *), otherwise the deltas are in sizeof(T).
This isn't fragmentation, it's just rounding up the size of your allocation to a round block size.
In general programming you should not pay attention to the pattern of memory addresses returned by general purpose allocators like new. When you do care about allocation behaviour you should always use a special purpose allocator (boost::pool, something you write yourself, etc.)
The exception is if you are studying allocators, in which case you could do worse than pick up your copy of K&R for a simple allocator which might help you understand how new gets its memory.
In general, you cannot depend on particular memory placement. The memory allocator's internal bookkeeping data and alignment requirements can both affect the placement of blocks. There is no requirement for blocks to be allocated contiguously.
Further, some systems will give you even "stranger" behavior. Many modern Linux systems have heap randomization enabled, where newly-allocated virtual memory pages are placed at random addresses to make certain types of security vulnerability exploits more difficult. With virtual memory, disparate allocated block addresses do not necessarily mean that the physical memory is fragmented, as there is no requirement for the virtual address space to be dense.
For small allocation, boost has a very simple allocator I've used called boost::simple_segregated_storage
It creates a copy of slists of free and used blocks, all the same size. As long as you only allocate to its set block size, you get no external fragmentation (though you can get some internal fragmentation if your block size is bigger than the requested size.) It also runs O(1) if you use it in this manner. Great for small allocation the likes of which are common with template programming.