How to implement a memory heap - c++

Wasn't exactly sure how to phrase the title, but the question is:
I've heard of programmers allocating a large section of contiguous memory at the start of a program and then dealing it out as necessary. This is, in contrast to simply going to the OS every time memory is needed.
I've heard that this would be faster because it would avoid the cost of asking the OS for contiguous blocks of memory constantly.
I believe the JVM does just this, maintaining its own section of memory and then allocating objects from that.
My question is, how would one actually implement this?

Most C and C++ compilers already provide a heap memory-manager as part of the standard library, so you don't need to do anything at all in order to avoid hitting the OS with every request.
If you want to improve performance, there are a number of improved allocators around that you can simply link with and go. e.g. Hoard, which wheaties mentioned in a now-deleted answer (which actually was quite good -- wheaties, why'd you delete it?).
If you want to write your own heap manager as a learning exercise, here are the basic things it needs to do:
Request a big block of memory from the OS
Keep a linked list of the free blocks
When an allocation request comes in:
search the list for a block that's big enough for the requested size plus some book-keeping variables stored alongside.
split off a big enough chunk of the block for the current request, put the rest back in the free list
if no block is big enough, go back to the OS and ask for another big chunk
When a deallocation request comes in
read the header to find out the size
add the newly freed block onto the free list
optionally, see if the memory immediately following is also listed on the free list, and combine both adjacent blocks into one bigger one (called coalescing the heap)

You allocate a chunk of memory at the beginning of the program large enough to sustain its need. Then you have to override new and/or malloc, delete and/or free to return memory from/to this buffer.
When implementing this kind of solution, you need to write your own allocator(to source from the chunk) and you may end up using more than one allocator which is often why you allocate a memory pool in the first place.
Default memory allocator is a good all around allocator but is not the best for all allocation needs. For example, if you know you'll be allocating a lot of object for a particular size, you may define an allocator that allocates fixed size buffer and pre-allocate more than one to gain some efficiency.

Here is the classic allocator, and one of the best for non-multithreaded use:
http://gee.cs.oswego.edu/dl/html/malloc.html
You can learn a lot from reading the explanation of its design. The link to malloc.c in the article is rotted; it can now be found at http://gee.cs.oswego.edu/pub/misc/malloc.c.
With that said, unless your program has really unusual allocation patterns, it's probably a very bad idea to write your own allocator or use a custom one. Especially if you're trying to replace the system malloc, you risk all kinds of bugs and compatibility issues from different libraries (or standard library functions) getting linked to the "wrong version of malloc".
If you find yourself needing specialized allocation for just a few specific tasks, that can be done without replacing malloc. I would recommend looking up GNU obstack and object pools for fixed-sized objects. These cover a majority of the cases where specialized allocation might have real practical usefulness.

Yes, both stdlib heap and OS heap / virtual memory are pretty troublesome.
OS calls are really slow, and stdlib is faster, but still has some "unnecessary"
locks and checks, and adds a significant overhead to allocated blocks
(ie some memory is used for management, in addition to what you allocate).
In many cases its possible to avoid dynamic allocation completely,
by using static structures instead. For example, sometimes its better (safer etc) to define a 64k
static buffer for unicode filename, than define a pointer/std:string and dynamically
allocate it.
When the program has to allocate a lot of instances of the same structure, its
much faster to allocate large memory blocks and then just store the instances there
(sequentially or using a linked list of free nodes) - C++ has a "placement new" for that.
In many cases, when working with varible-size objects, the set of possible sizes
is actually very limited (eg. something like 4+2*(1..256)), so its possible to use
a few pools like [3] without having to collect garbage, fill the gaps etc.
Its common for a custom allocator for specific task to be much faster than one(s)
from standard library, and even faster than speed-optimized, but too universal implementations.
Modern CPUs/OSes support "large pages", which can significantly improve the memory
access speed when you explicitly work with large blocks - see http://7-max.com/

IBM developerWorks has a nice article about memory management, with an extensive resources section for further reading: Inside memory management.
Wikipedia has some good information as well: C dynamic memory allocation, Memory management.

Related

Efficient way to truncate a std::vector<char> to length N - to free memory

I have several large std::vectors of chars (bytes loaded from binary files).
When my program runs out of memory, I need to be able to cull some of the memory used by these vectors. These vectors are almost the entirety of my memory usage, and they're just caches for local and network files, so it's safe to just grab the largest one and chop it in half or so.
Only thing is, I'm currently using vector::resize and vector::shrink_to_fit, but this seems to require more memory (I imagine for a reallocation of the new size) and then a bunch of time (for destruction of the now destroyed pointers, which I thought would be free?) and then copying the remaining to the new vector. Note, this is on Windows platform, in debug, so the pointers might not be destroyed in the Release build or on other platforms.
Is there something I can do to just say "C++, please tell the OS that I no longer need the memory located past location N in this vector"?
Alternatively, is there another container I'd be better off using? I do need to have random access though, or put effort into designing a way to easily keep iterators pointing at the place I'll want to read next, which would be possible, just not easy, so I'd prefer not using a std::list.
resize and shrink_to_fit are your best bets, as long as we are talking about standard C++, but these, as you noticed may not help at all if you are in a low memory situation to begin with: since the allocator interface do not provide a realloc-like operation, vector is forced to allocate a new block, copy the data in it and deallocate the old block.
Now, I see essentially four easy ways out:
drop whole vectors, not just parts of them, possibly using an LRU or stuff like that; working with big vectors, the C++ allocator normally just forwards to the OS's memory management calls, so the memory should go right back to the OS;
write your own container which uses malloc/realloc, or OS-specific functionality;
use std::deque instead; you lose guaranteed contiguousness of data, but, since deques normally allocate the space for data in distinct chunks, doing a resize+shrink_to_fit should be quite cheap - simply all the unused blocks at the end are freed, with no need for massive reallocations;
just leave this job to the OS. As already stated in the comments, the OS already has a file cache, and in normal cases it can handle it better than you or me, even just for the fact that it has a better vision of how much physical memory is left for that, what files are "hot" for most applications, and so on. Also, since you are working in a virtual address space you cannot even guarantee that your memory will actually stay in RAM; the very moment that the machine goes in memory pressure and you aren't using some memory pages so often, they get swapped to disk, so all your performance gain is lost (and you wasted space on the paging file for stuff that is already found on disk).
An additional way may be to just use memory mapped files - the system will do its own caching as usual, but you avoid any syscall overhead as long as the file remains in memory.
std::vector::shrink_to_fit() cannot result in more memory being used, if so it's a bug.
C++11 defines shrink_to_fit() as follows:
void shrink_to_fit(); Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. — end note ]
As the note indicates, shrink_to_fit() may, but not necessarily, actually free memory, and the standard gives C++ implementations a free hand to recycle and optimize memory usage internally, as they see fit. C++ does not make it mandatory for shrink_to_fit(), and the like, to result in actually memory being released to the operating system, and in many cases the C++ runtime library may not actually be able to, as I'll get to in a moment. The C++ runtime library is allowed to take the freed memory, and stash it away internally, and reuse it automatically for the future memory allocation requests (explicit news, or container growth).
Most modern operating systems are not designed to allocate and release memory blocks of arbitrary sizes. Details differ, but typically an operating system allocates and deallocates memory in even chunks, typically 4Kb, or larger, at even memory page addresses. If you allocate a new object that's only a few hundred bytes long, the C++ library will request an entire page of memory to be allocated, take the first hundred bytes of it for the new object, then keep the spare amount of memory for future new requests.
Similarly, even if shrink_to_fit(), or delete, frees up a few hundred bytes, it can't go back to the operating system immediately, but only when an entire 4kb continuous memory range (or whatever is the allocation page size used by the operating system) -- suitably aligned -- is completely unused at all. Only then can a process release that page back to the operating system. Until then, the library keeps track of freed memory ranges, to be used for future new requests, without asking the operating system to allocate more pages of memory to the process.

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

questions about memory pool

I need some clarifications for the concept & implementation on memory pool.
By memory pool on wiki, it says that
also called fixed-size-blocks allocation, ... ,
as those implementations suffer from fragmentation because of variable
block sizes, it can be impossible to use them in a real time system
due to performance.
How "variable block size causes fragmentation" happens? How fixed sized allocation can solve this? This wiki description sounds a bit misleading to me. I think fragmentation is not avoided by fixed sized allocation or caused by variable size. In memory pool context, fragmentation is avoided by specific designed memory allocators for specific application, or reduced by restrictly using an intended block of memory.
Also by several implementation samples, e.g., Code Sample 1 and Code Sample 2, it seems to me, to use memory pool, the developer has to know the data type very well, then cut, split, or organize the data into the linked memory chunks (if data is close to linked list) or hierarchical linked chunks (if data is more hierarchical organized, like files). Besides, it seems the developer has to predict in prior how much memory he needs.
Well, I could imagine this works well for an array of primitive data. What about C++ non-primitive data classes, in which the memory model is not that evident? Even for primitive data, should the developer consider the data type alignment?
Is there good memory pool library for C and C++?
Thanks for any comments!
Variable block size indeed causes fragmentation. Look at the picture that I am attaching:
The image (from here) shows a situation in which A, B, and C allocates chunks of memory, variable sized chunks.
At some point, B frees all its chunks of memory, and suddenly you have fragmentation. E.g., if C needed to allocate a large chunk of memory, that still would fit into available memory, it could not do because available memory is split in two blocks.
Now, if you think about the case where each chunk of memory would be of the same size, this situation would clearly not arise.
Memory pools, of course, have their own drawbacks, as you yourself point out. So you should not think that a memory pool is a magical wand. It has a cost and it makes sense to pay it under specific circumstances (i.e., embedded system with limited memory, real time constraints and so on).
As to which memory pool is good in C++, I would say that it depends. I have used one under VxWorks that was provided by the OS; in a sense, a good memory pool is effective when it is tightly integrated with the OS. Actually each RTOS offers an implementation of memory pools, I guess.
If you are looking for a generic memory pool implementation, look at this.
EDIT:
From you last comment, it seems to me that possibly you are thinking of memory pools as "the" solution to the problem of fragmentation. Unfortunately, this is not the case. If you want, fragmentation is the manifestation of entropy at the memory level, i.e., it is inevitable. On the other hand, memory pools are a way to manage memory in such a way as to effectively reduce the impact of fragmentation (as I said, and as wikipedia mentioned, mostly on specific systems like real time systems). This comes to a cost, since a memory pool can be less efficient than a "normal" memory allocation technique in that you have a minimum block size. In other words, the entropy reappears under disguise.
Furthermore, that are many parameters that affect the efficiency of a memory pool system, like block size, block allocation policy, or whether you have just one memory pool or you have several memory pools with different block sizes, different lifetimes or different policies.
Memory management is really a complex matter and memory pools are just a technique that, like any other, improves things in comparison to other techniques and exact a cost of its own.
In a scenario where you always allocate fixed-size blocks, you either have enough space for one more block, or you don't. If you have, the block fits in the available space, because all free or used spaces are of the same size. Fragmentation is not a problem.
In a scenario with variable-size blocks, you can end up with multiple separate free blocks with varying sizes. A request for a block of a size that is less than the total memory that is free may be impossible to be satisfied, because there isn't one contiguous block big enough for it. For example, imagine you end up with two separate free blocks of 2KB, and need to satisfy a request for 3KB. Neither of these blocks will be enough to provide for that, even though there is enough memory available.
Both fix-size and variable size memory pools will feature fragmentation, i.e. there will be some free memory chunks between used ones.
For variable size, this might cause problems, since there might not be a free chunk that is big enough for a certain requested size.
For fixed-size pools, on the other hand, this is not a problem, since only portions of the pre-defined size can be requested. If there is free space, it is guaranteed to be large enough for (a multiple of) one portion.
If you do a hard real time system, you might need to know in advance that you can allocate memory within the maximum time allowed. That can be "solved" with fixed size memory pools.
I once worked on a military system, where we had to calculate the maximum possible number of memory blocks of each size that the system could ever possibly use. Then those numbers were added to a grand total, and the system was configured with that amount of memory.
Crazily expensive, but worked for the defence.
When you have several fixed size pools, you can get a secondary fragmentation where your pool is out of blocks even though there is plenty of space in some other pool. How do you share that?
With a memory pool, operations might work like this:
Store a global variable that is a list of available objects (initially empty).
To get a new object, try to return one from the global list of available. If there isn't one, then call operator new to allocate a new object on the heap. Allocation is extremely fast which is important for some applications that might currently be spending a lot of CPU time on memory allocations.
To free an object, simply add it to the global list of available objects. You might place a cap on the number of items allowed in the global list; if the cap is reached then the object would be freed instead of returned to the list. The cap prevents the appearance of a massive memory leak.
Note that this is always done for a single data type of the same size; it doesn't work for larger ones and then you probably need to use the heap as usual.
It's very easy to implement; we use this strategy in our application. This causes a bunch of memory allocations at the beginning of the program, but no more memory freeing/allocating occurs which incurs significant overhead.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.
It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.
Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.
As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?
The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".
One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

Any reason to overload global new and delete?

Unless you're programming parts of an OS or an embedded system are there any reasons to do so? I can imagine that for some particular classes that are created and destroyed frequently overloading memory management functions or introducing a pool of objects might lower the overhead, but doing these things globally?
Addition
I've just found a bug in an overloaded delete function - memory wasn't always freed. And that was in a not-so memory critical application. Also, disabling these overloads decreases performance by ~0.5% only.
We overload the global new and delete operators where I work for many reasons:
pooling all small allocations -- decreases overhead, decreases fragmentation, can increase performance for small-alloc-heavy apps
framing allocations with a known lifetime -- ignore all the frees until the very end of this period, then free all of them together (admittedly we do this more with local operator overloads than global)
alignment adjustment -- to cacheline boundaries, etc
alloc fill -- helping to expose usage of uninitialized variables
free fill -- helping to expose usage of previously deleted memory
delayed free -- increasing the effectiveness of free fill, occasionally increasing performance
sentinels or fenceposts -- helping to expose buffer overruns, underruns, and the occasional wild pointer
redirecting allocations -- to account for NUMA, special memory areas, or even to keep separate systems separate in memory (for e.g. embedded scripting languages or DSLs)
garbage collection or cleanup -- again useful for those embedded scripting languages
heap verification -- you can walk through the heap data structure every N allocs/frees to make sure everything looks ok
accounting, including leak tracking and usage snapshots/statistics (stacks, allocation ages, etc)
The idea of new/delete accounting is really flexible and powerful: you can, for example, record the entire callstack for the active thread whenever an alloc occurs, and aggregate statistics about that. You could ship the stack info over the network if you don't have space to keep it locally for whatever reason. The types of info you can gather here are only limited by your imagination (and performance, of course).
We use global overloads because it's convenient to hang lots of common debugging functionality there, as well as make sweeping improvements across the entire app, based on the statistics we gather from those same overloads.
We still do use custom allocators for individual types too; in many cases the speedup or capabilities you can get by providing custom allocators for e.g. a single point-of-use of an STL data structure far exceeds the general speedup you can get from the global overloads.
Take a look at some of the allocators and debugging systems that are out there for C/C++ and you'll rapidly come up with these and other ideas:
valgrind
electricfence
dmalloc
dlmalloc
Application Verifier
Insure++
BoundsChecker
...and many others... (the gamedev industry is a great place to look)
(One old but seminal book is Writing Solid Code, which discusses many of the reasons you might want to provide custom allocators in C, most of which are still very relevant.)
Obviously if you can use any of these fine tools you will want to do so rather than rolling your own.
There are situations in which it is faster, easier, less of a business/legal hassle, nothing's available for your platform yet, or just more instructive: dig in and write a global overload.
The most common reason to overload new and delete are simply to check for memory leaks, and memory usage stats. Note that "memory leak" is usually generalized to memory errors. You can check for things such as double deletes and buffer overruns.
The uses after that are usually memory-allocation schemes, such as garbage collection, and pooling.
All other cases are just specific things, mentioned in other answers (logging to disk, kernel use).
In addition to the other important uses mentioned here, like memory tagging, it's also the only way to force all allocations in your app to go through fixed-block allocation, which has enormous implications for performance and fragmentation.
For example, you may have a series of memory pools with fixed block sizes. Overriding global new lets you direct all 61-byte allocations to, say, the pool with 64-byte blocks, all 768-1024 byte allocs to the the 1024b-block pool, all those above that to the 2048 byte block pool, and anything larger than 8kb to the general ragged heap.
Because fixed block allocators are much faster and less prone to fragmentation than allocating willy-nilly from the heap, this lets you force even crappy 3d party code to allocate from your pools and not poop all over the address space.
This is done often in systems which are time- and space-critical, such as games. 280Z28, Meeh, and Dan Olson have described why.
UnrealEngine3 overloads global new and delete as part of its core memory management system. There are multiple allocators that provide different features (profiling, performance, etc.) and they need all allocations to go through it.
Edit: For my own code, I would only ever do it as a last resort. And by that I mean I would almost positively never use it. But my personal projects are obviously much smaller/very different requirements.
Some realtime systems overload them to avoid them being used after init..
Overloading new & delete makes it possible to add a tag to your memory allocations. I tag allocations per system or control or by middleware. I can view, at runtime, how much each uses. Maybe I want to see the usage of a parser separated from the UI or how much a piece of middleware is really using!
You can also use it to put guard bands around the allocated memory. If/when your app crashes you can take a look at the address. If you see the contents as "0xABCDABCD" (or whatever you choose as guard) you are accessing memory you don't own.
Perhaps after calling delete you can fill this space with a similarly recognizable pattern.
I believe VisualStudio does something similar in debug. Doesn't it fill uninitialized memory with 0xCDCDCDCD?
Finally, if you have fragmentation issues you could use it to redirect to a block allocator? I am not sure how often this is really a problem.
You need to overload them when the call to new and delete doesn't work in your environment.
For example, in kernel programming, the default new and delete don't work as they rely on user mode library to allocate memory.
From a practical standpoint it may just be better to override malloc on a system library level, since operator new will probably be calling it anyway.
On linux, you can put your own version of malloc in place of the system one, as in this example here:
http://developers.sun.com/solaris/articles/lib_interposers.html
In that article, they are trying to collect performance statistics. But you may also detect memory leaks if you also override free.
Since you are doing this in a shared library with LD_PRELOAD, you don't even need to recompile your application.
I've seen it done in a system that for 'security'* reasons was required to write over all memory it used on de-allocation. The approach was to allocate an extra few bytes at the start of each block of memory which would contain the size of the overall block which would then be overwritten with zeros on delete.
This had a number of problems as you can probably imagine but it did work (mostly) and saved the team from reviewing every single memory allocation in a reasonably large, existing application.
Certainly not saying that it is a good use but it is probably one of the more imaginative ones out there...
* sadly it wasn't so much about actual security as the appearance of security...
Photoshop plugins written in C++ should override operator new so that they obtain memory via Photoshop.
I've done it with memory mapped files so that data written to the memory is automatically also saved to disk.
It's also used to return memory at a specific physical address if you have memory mapped IO devices, or sometimes if you need to allocate a certain block of contiguous memory.
But 99% of the time it's done as a debugging feature to log how often, where, when memory is being allocated and released.
It's actually pretty common for games to allocate one huge chunk of memory from the system and then provide custom allocators via overloaded new and delete. One big reason is that consoles have a fixed memory size, making both leaks and fragmentation large problems.
Usually (at least on a closed platform) the default heap operations come with a lack of control and a lack of introspection. For many applications this doesn't matter, but for games to run stably in fixed-memory situations the added control and introspection are both extremely important.
It can be a nice trick for your application to be able to respond to low memory conditions by something else than a random crash. To do this your new can be a simple proxy to the default new that catches its failures, frees up some stuff and tries again.
The simplest technique is to reserve a blank block of memory at start-up time for that very purpose. You may also have some cache you can tap into - the idea is the same.
When the first allocation failure kicks in, you still have time to warn your user about the low memory conditions ("I'll be able to survive a little longer, but you may want to save your work and close some other applications"), save your state to disk, switch to survival mode, or whatever else makes sense in your context.
The most common use case is probably leak checking.
Another use case is when you have specific requirements for memory allocation in your environment which are not satisfied by the standard library you are using, like, for instance, you need to guarantee that memory allocation is lock free in a multi threaded environment.
As many have already stated this is usually done in performance critical applications, or to be able to control memory alignment or track your memory. Games frequently use custom memory managers, especially when targeting specific platforms/consoles.
Here is a pretty good blog post about one way of doing this and some reasoning.
Overloaded new operator also enables programmers to squeeze some extra performance out of their programs. For example, In a class, to speed up the allocation of new nodes, a list of deleted nodes is maintained so that their memory can be reused when new nodes are allocated.In this case, the overloaded delete operator will add nodes to the list of deleted nodes and the overloaded new operator will allocate memory from this list rather than from the heap to speedup memory allocation. Memory from the heap can be used when the list of deleted nodes is empty.