Why we use memory managers? - c++

I have seen that lot of code bases specially server codes have basic (sometimes advanced) memory managers. Is the real purpose of memory manager is to reduce number of malloc calls or mainly for the purpose of memory analysis, corruption check or may be other application centric purposes.
Is the argument of saving malloc calls reasonable enough as malloc in itself is a memory manager. The only performance gain I can reason is when we know that system always ask for same size memory.
Or the reason for having memory manager is that free does not return memory to OS but saves in the list. So over the lifetime of the process, the heap usage of the process may increase if we keep on doing malloc/free because of fragmentation.

mallocis a general purpose allocator - "not slow" is more important than "always fast".
Consider a feature that would be a 10% improvement in many common cases, but might cause significant performance degradation in a few rare cases. An application specific allocator can avoid the rare case and reap the benefits. A general purpose allocator should not.
Besides number of calls to malloc, there are other relevant attributes:
locality of allocations
On current hardware, this easily the most important factor for performance. An application has more knowledge of the access patterns and can optimize the allocations accordingly.
multithreading
A general purpose allocator must allow calls to malloc and free from different threads. This usually requires a lock or similar concurrency handling. If the heap is very busy, this leads to massive contention.
An application that knows that some high-frequency alloc/frees come only from one thread can use its own thread-specific heap, which not only avoids contention for these allocations, but also increases their locality and takes load off the default allocator.
fragmentation
This is still a problem for long running applications on systems with limited physical memory or address space. Fragmentation may require more and more memory or address space from the OS, even without the actual working set increasing. This is a significant problem for applications that need to run uninterrupted.
Last time I looked deeper into allocators (which is probably half a decade past), the consensus was that naive attempts to reduce fragmentation often conflict with the never slow rule.
Again, an application that knows (some of its) allocation patterns can take a lot of load from the default allocator. One very common use case is building a syntax tree or something similar: there are gazillions of small allocations which are never freed individually, only as a whole. Such a pattern can be served efficiently with a very trivial allocator.
resilence and diagnostics
Last not least the diagnostic and self-protection capabilities of the default allocator may not be sufficient for many applications.

Why do we have custom memory managers rather than the built-in ones?
Number one reason is probably that the codebase was originaly written 20-30years ago when the provided one wasn't any good and nobody dares change it.
But otherwise, as you say because the application needs to manage fragmentation, grab memory at startup to ensure that memory will always be available, for security or a bunch of other reasons - most of which could be acheived by correct use of the built-in manager.

C and C++ are designed to be stripped down. They don't do much that is not explicitly asked for, so when a program asks for memory, it gets the minimum possible effort required to deliver that memory.
In other words, if you don't need it, you don't pay for it.
If finer-grained control of the memory is required, that's the domain of the programmer. If the programmer wishes to trade bare metal speed for a system that will provide higher performance on the target hardware in conjunction with the program's often unique goals, better debugging support, or simply likes the look and feel and warm fuzzies that come from using a manager, that is up to them. The programmer either writes something smarter or finds a third party library to do what they want.

You briefly touched on a lot of the different reasons why you would use a memory manager in your question.
Is the real purpose of a memory manager to reduce the number of malloc calls or mainly for the purpose of memory analysis, corruption check or other application centric purposes?
This is the big question. A memory manager in any application can be generic (like malloc) or it can be more specific. The more specialized the memory manager becomes it is likely to be more efficient at the specific task it is supposed to accomplish.
Take this overly-simplified example:
#define MAX_OBJECTS 1000
Foo globalObjects[MAX_OBJECTS];
int main(int argc, char ** argv)
{
void * mallocObjects[MAX_OBJECTS] = {0};
void * customObjects[MAX_OBJECTS] = {0};
for(int i = 0; i < 1000; ++i)
{
mallocObjects[i] = malloc(sizeof(Foo));
customObjects[i] = &globalObjects[i];
}
}
In the above I am pretending that this global object list is our "custom memory allocator." This is just to simplify what I am explaining.
When you allocate with malloc there is no guarantee it is right next to the previous allocation. Malloc is a general purpose allocator and does a good job at that but doesn't necessarily make the most efficient choice for every application.
With a custom allocator you might be able to up front allocate room for 1000 custom objects and since they are a fixed size return the exact amount of memory you need to prevent fragmentation and to efficiently allocate that block.
There is also the difference between memory abstraction and custom memory allocators. STL allocators are arguably an abstraction model and not a custom memory allocator.
Take a look at this link for some more information on custom allocators and why they are useful: gamedev.net link

There are many reasons why we would want to do this and it really depends on the application itself. In fact all the reasons you mentioned are valid.
I once built a very simple memory manager that kept track of shared_ptr allocations in order for me to see what was not being released properly on application end.
I would say stick to your runtime unless you need something that it does not provide.

Memory managers are used basically to manage efficiently your memory reservation. Normally processes have access to a limited amount of memory (4GB in 32bits systems), from this you have to subtract the virtual memory space reserved for the kernel (1GB or 2GB depending on your OS configuration). Thus, virtually the process has access let's say to 3GB of memory that will be used to hold all of its segments (code, data, bss, heap and stack).
Memory managers (malloc for example) try to fulfill the different memory reservation requests issued by the process by requesting new memory pages to the OS (using sbrk or mmap system calls). Every time this happens it implies an extra cost on the program execution since the OS has to look for a suitable memory page to be assigned to the process (Physical memory is limited and all the running processes want to use it), update the process tables (TMP, etc). These operations are time consuming and hit the process execution and performance. Thus, the memory manager normally try to request the needed pages to fulfill the process reservations cleverly. For example it could ask for some more pages to avoid calling more mmap calls in the near future. Additionally, it tries to deal with issues like fragmentation, memory alignment, etc. This basically unloads the process from this responsibility, otherwise everybody writing some program that needs dynamic memory allocation has to perform this manually!
Actually, there are cases where one could be interested in doing the memory management manually. This is the case for embedded or high availability systems which have to run for 24/365. In these cases even if the memory fragmentation is low it could become a problem after very long period of running (1 year for example). So, one of the solutions that are used in this case is to use a memory pool to allocate before hand the memory for the application objects. After-wards each time you need memory for some object you just use the already reserved memory.

For server based or any application that needs to run for long periods of time or indefinitely, the main issue is paged memory fragmentation. After a long series of mallocs / new and free / delete, paged memory can end up with gaps in the pages that waste space and could eventually run out of virtual address space. Microsoft deals with this with it's .NET framework, by occasionally pausing a process to repack paged memory for a process.
To avoid slowdown during repacking of memory in a process, a server type application can use multiple processes for the application, so that during repacking of one process, the other process(es) take more of the load.

Related

Implementing a memory manager in multithreaded C/C++ with dynamically sized memory pool?

Background: I'm developing a multiplatform framework of sorts that will be used as base for both game and util/tool creation. The basic idea is to have a pool of workers, each executing in its own thread. (Furthermore, workers will also be able to spawn at runtime.) Each thread will have it's own memory manager.
I have long thought about creating my own memory management system, and I think this project will be perfect to finally give it a try. I find such a system fitting due to the types of usages of this framework will often require memory allocations in realtime (games and texture edition tools).
Problems:
No generally applicable solution(?) - The framework will be used for both games/visualization (not AAA, but indie/play) and tool/application creation. My understanding is that for game development it is usual (at least for console games) to allocate a big chunk of memory only once in the initialization, and then use this memory internally in the memory manager. But is this technique applicable in a more general application?
In a game you could theoretically know how much memory your scenes and resources will need, but for example, a photo editing application will load resources of all different sizes... So in the latter case a more dynamic memory "chunk size" would be needed? Which leads me to the next problem:
Moving already allocated data and keeping valid pointers - Normally when allocating on the heap, you will acquire a simple pointer to the memory chunk. In a custom memory manager, as far as I understand it, a similar approach is then to return a pointer to somewhere free in the pre-allocated chunk. But what happens if the pre-allocated chunk is too small and needs to be resized or even defragmentated? The data would be needed to be moved around in the memory and the old pointers would be invalid. Is there a way to transparently wrap these pointers in some way, but still use them as normally "outside" the memory management as if they were usual C++ pointers?
Third party libraries - If there is no way to transparently use a custom memory management system for all memory allocation in the application, every third party library I'm linking with, will still use the "old" OS memory allocations internally. I have learned that it is common for libraries to expose functions to set custom allocation functions that the library will use, but it is not guaranteed every library I will use will have this ability.
Questions: Is it possible and feasible to implement a memory manager that can use a dynamically sized memory chunk pool? If so, how would defragmentation and memory resize work, without breaking currently in-use pointers? And finally, how is such a system best implemented to work with third party libraries?
I'm also thankful for any related reading material, papers, articles and whatnot! :-)
As someone who has previously written many memory managers and heap implementations for AAA games for the last few generations of consoles let me tell you its simply not worth it anymore.
Your information is old - back in the gamecube era [circa 2003] we used to do what you said- allocate a large chunk and carve out that chunk manually using custom algorithms tweaked for each game.
Once virtual memory came along (xbox era), games got more complicated [and so made more allocations and became multimthreaded] address fragmentation made this untenable. So we switched to custom allocators to handle certain types of requests only - for instance physical memory, or lock free small block low fragmentation heaps or thread local cache of recently used blocks.
As built in memory managers become better it gets harder to do better than those - certainly in the general case and a close thing for a specific use cases. Doug Lea Allocator [or whatever the mainstream c++ linux compilers come with now] and the latest Windows low fragmentation heaps are really very good, and you'd do far better investing your time elsewhere.
I've got spreadsheets at work measuring all kinds of metrics for a whole load of allocators - all the big name ones and a fair few I've collected over the years. And basically whilst the specialist allocators can win on a few metrics [lowest overhead per alloc, spacial proximity, lowest fragmentation, etc] for overall metrics the mainstream ones are simply the best.
As a user of your library, my personal preferred option is you just allocate memory when you need it. Use operator new/the new operator and I can use the standard C++ mechanisms to replace those and use my custom heap (if I indeed have one), or alternatively I can use platform specific ways of replacing your allocations (e.g. XMemAlloc on Xbox). I don't need tagging [capturing callstacks is far superior which I can do if I want]. Lower down that list comes you giving me an interface that you'll call when you need to allocate memory - this is just a pain for you to implement and I'll probably just pass it onto operator new anyway. The worst thing you can do is 'know best' and create your own custom heaps. If memory allocation performance is a problem, I'd much rather you share the solution the whole game uses than roll your own.
If you're looking to write your own malloc()/free(), etc., you probably should start by checking out the source code for existing systems such as dlmalloc. This is a hard problem, though, for what it's worth. Writing your own malloc library is Hard. Beating existing general purpose malloc libraries will be Even Harder.
And now, here is the correct answer: DON'T IMPLEMENT YET ANOTHER MEMORY MANAGER.
It is incredibly hard to implement a memory manager that does not fail under different kinds of usage patterns and events. You may be able to build a specific manager that works well under YOUR usage patterns, but to write one which works well for MANY users is a full-time job that almost no one has really done well. Worse, it is fantastically easy to implement a memory manager that works great 99% of the time and then 1% of the time crash or suddenly consume most or all available memory on your system due to unexpected heap fragmentation.
I say this as someone who has written multiple memory managers, watched multiple people write their own memory managers, and watched even more people attempt to write memory managers and fail. This problem is deceptively difficult, not because it's hard to write templated allocators and generic types with inheritance and such, but because the other solutions given in this thread tend to fail under corner types of load behavior. Once you start supporting byte alignments (as all real-world allocators must) then heap fragmentation rears its ugly head. Cute heuristics that work great for small test programs, fail miserably when subjected to large, real-world programs.
And once you get it working, someone else will need: cookies to verify against memory stomps; heap usage reporting; memory pools; pools of pools; memory leak tracking and reporting; heap auditing; chunk splitting and coalescing; thread-local storage; lookasides; CPU and process-level page faulting and protection; setting and checking and clearing "free-memory" patterns aka 0xdeadbeef; and whatever else I can't think of off the top of my head.
Writing yet another memory manager falls squarely under the heading of Premature Optimization. Since there are multiple free, good, memory managers with thousands of hours of development and testing behind them, you have to justify spending the cost of your own time in such a way that the result would provide some sort of measurable improvement over what other people have done, and you can use, for free.
If you are SURE you want to implement your own memory manager (and hopefully you are NOT sure after reading this message), read through the dlmalloc sources in detail, then read through the tcmalloc sources in detail as well, THEN make sure you understand the performance trade-offs in implementing a thread-safe versus a thread-unsafe memory manager, and why the naive implementations tend to give poor performance results.
Prepare more than one solution and let the user of the framework adopt any particular one. Policy classes to the generic allocator you develop would do this nicely.
A nice way to get around this is to wrap up pointers in a class with overloaded * operator. Make the internal data of that class only an index to the memory pool. Now, you can just change the index quickly after a background thread copies the data over.
Most good C++ libraries support allocators and you should implement one. You can also overload the global new so your version gets used. And keep in mind that you generally won't need to think about a library allocating or deallocating a large amount of data, which is generally a responsibility of client code.

Concurrent dynamic memory management on an AVR32

I am developing the software system for an embedded system of a student satellite. Our code is a mix of C/C++, running on an an AT32UC3A3256S 32-bit AVR microcontroller. We are running the FreeRTOS operating system on the hardware which is working fine. We also have a need for a somewhat specialized memory management scheme due to the physical memory layout, and concept of operations for the mission.
I have been attempting to use a dynamic memory implementation called dlmalloc, largely due to the availability of mspaces, which allows dynamic memory allocation into contained and tracked sections. I have some code that wraps around dlmalloc, in order to create mspaces in certain places in memory, and cause allocations to be tied to these mspaces depending on the FreeRTOS task making the request. The end product is a memory management system that amount of memory a given task has allocated, and if it has gone over it's imposed limit, will reset the task and free it's memory.
I have created a test task that essentially is a big memory leak, continuously allocating memory without freeing it. The memory management system in place should periodically reset this task as it overflows its limit, freeing all memory that would otherwise be leaked. This works perfectly for a single task running, however fails in very odd ways if two similar copies of this task run simultaneously, leading me to believe that the memory allocation is not thread safe.
I have surrounded every call to memory allocation routines with FreeRTOS routines that ensure that only the task allocating memory will run for the duration of the memory request. To me this seems like it should provide the thread-safeness needed, but obviously something else is wrong.
Does anybody have any ideas on what I might be missing to make this system thread-safe, on how to port dlmalloc to the hardware I am using, on any other concurrent memory allocators I could possibly use, or any advice at all? I can provide much more information if necessary but not did want to bloat the original post more than I already have.

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.
It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.
Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.
As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?
The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".
One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

Any reason to overload global new and delete?

Unless you're programming parts of an OS or an embedded system are there any reasons to do so? I can imagine that for some particular classes that are created and destroyed frequently overloading memory management functions or introducing a pool of objects might lower the overhead, but doing these things globally?
Addition
I've just found a bug in an overloaded delete function - memory wasn't always freed. And that was in a not-so memory critical application. Also, disabling these overloads decreases performance by ~0.5% only.
We overload the global new and delete operators where I work for many reasons:
pooling all small allocations -- decreases overhead, decreases fragmentation, can increase performance for small-alloc-heavy apps
framing allocations with a known lifetime -- ignore all the frees until the very end of this period, then free all of them together (admittedly we do this more with local operator overloads than global)
alignment adjustment -- to cacheline boundaries, etc
alloc fill -- helping to expose usage of uninitialized variables
free fill -- helping to expose usage of previously deleted memory
delayed free -- increasing the effectiveness of free fill, occasionally increasing performance
sentinels or fenceposts -- helping to expose buffer overruns, underruns, and the occasional wild pointer
redirecting allocations -- to account for NUMA, special memory areas, or even to keep separate systems separate in memory (for e.g. embedded scripting languages or DSLs)
garbage collection or cleanup -- again useful for those embedded scripting languages
heap verification -- you can walk through the heap data structure every N allocs/frees to make sure everything looks ok
accounting, including leak tracking and usage snapshots/statistics (stacks, allocation ages, etc)
The idea of new/delete accounting is really flexible and powerful: you can, for example, record the entire callstack for the active thread whenever an alloc occurs, and aggregate statistics about that. You could ship the stack info over the network if you don't have space to keep it locally for whatever reason. The types of info you can gather here are only limited by your imagination (and performance, of course).
We use global overloads because it's convenient to hang lots of common debugging functionality there, as well as make sweeping improvements across the entire app, based on the statistics we gather from those same overloads.
We still do use custom allocators for individual types too; in many cases the speedup or capabilities you can get by providing custom allocators for e.g. a single point-of-use of an STL data structure far exceeds the general speedup you can get from the global overloads.
Take a look at some of the allocators and debugging systems that are out there for C/C++ and you'll rapidly come up with these and other ideas:
valgrind
electricfence
dmalloc
dlmalloc
Application Verifier
Insure++
BoundsChecker
...and many others... (the gamedev industry is a great place to look)
(One old but seminal book is Writing Solid Code, which discusses many of the reasons you might want to provide custom allocators in C, most of which are still very relevant.)
Obviously if you can use any of these fine tools you will want to do so rather than rolling your own.
There are situations in which it is faster, easier, less of a business/legal hassle, nothing's available for your platform yet, or just more instructive: dig in and write a global overload.
The most common reason to overload new and delete are simply to check for memory leaks, and memory usage stats. Note that "memory leak" is usually generalized to memory errors. You can check for things such as double deletes and buffer overruns.
The uses after that are usually memory-allocation schemes, such as garbage collection, and pooling.
All other cases are just specific things, mentioned in other answers (logging to disk, kernel use).
In addition to the other important uses mentioned here, like memory tagging, it's also the only way to force all allocations in your app to go through fixed-block allocation, which has enormous implications for performance and fragmentation.
For example, you may have a series of memory pools with fixed block sizes. Overriding global new lets you direct all 61-byte allocations to, say, the pool with 64-byte blocks, all 768-1024 byte allocs to the the 1024b-block pool, all those above that to the 2048 byte block pool, and anything larger than 8kb to the general ragged heap.
Because fixed block allocators are much faster and less prone to fragmentation than allocating willy-nilly from the heap, this lets you force even crappy 3d party code to allocate from your pools and not poop all over the address space.
This is done often in systems which are time- and space-critical, such as games. 280Z28, Meeh, and Dan Olson have described why.
UnrealEngine3 overloads global new and delete as part of its core memory management system. There are multiple allocators that provide different features (profiling, performance, etc.) and they need all allocations to go through it.
Edit: For my own code, I would only ever do it as a last resort. And by that I mean I would almost positively never use it. But my personal projects are obviously much smaller/very different requirements.
Some realtime systems overload them to avoid them being used after init..
Overloading new & delete makes it possible to add a tag to your memory allocations. I tag allocations per system or control or by middleware. I can view, at runtime, how much each uses. Maybe I want to see the usage of a parser separated from the UI or how much a piece of middleware is really using!
You can also use it to put guard bands around the allocated memory. If/when your app crashes you can take a look at the address. If you see the contents as "0xABCDABCD" (or whatever you choose as guard) you are accessing memory you don't own.
Perhaps after calling delete you can fill this space with a similarly recognizable pattern.
I believe VisualStudio does something similar in debug. Doesn't it fill uninitialized memory with 0xCDCDCDCD?
Finally, if you have fragmentation issues you could use it to redirect to a block allocator? I am not sure how often this is really a problem.
You need to overload them when the call to new and delete doesn't work in your environment.
For example, in kernel programming, the default new and delete don't work as they rely on user mode library to allocate memory.
From a practical standpoint it may just be better to override malloc on a system library level, since operator new will probably be calling it anyway.
On linux, you can put your own version of malloc in place of the system one, as in this example here:
http://developers.sun.com/solaris/articles/lib_interposers.html
In that article, they are trying to collect performance statistics. But you may also detect memory leaks if you also override free.
Since you are doing this in a shared library with LD_PRELOAD, you don't even need to recompile your application.
I've seen it done in a system that for 'security'* reasons was required to write over all memory it used on de-allocation. The approach was to allocate an extra few bytes at the start of each block of memory which would contain the size of the overall block which would then be overwritten with zeros on delete.
This had a number of problems as you can probably imagine but it did work (mostly) and saved the team from reviewing every single memory allocation in a reasonably large, existing application.
Certainly not saying that it is a good use but it is probably one of the more imaginative ones out there...
* sadly it wasn't so much about actual security as the appearance of security...
Photoshop plugins written in C++ should override operator new so that they obtain memory via Photoshop.
I've done it with memory mapped files so that data written to the memory is automatically also saved to disk.
It's also used to return memory at a specific physical address if you have memory mapped IO devices, or sometimes if you need to allocate a certain block of contiguous memory.
But 99% of the time it's done as a debugging feature to log how often, where, when memory is being allocated and released.
It's actually pretty common for games to allocate one huge chunk of memory from the system and then provide custom allocators via overloaded new and delete. One big reason is that consoles have a fixed memory size, making both leaks and fragmentation large problems.
Usually (at least on a closed platform) the default heap operations come with a lack of control and a lack of introspection. For many applications this doesn't matter, but for games to run stably in fixed-memory situations the added control and introspection are both extremely important.
It can be a nice trick for your application to be able to respond to low memory conditions by something else than a random crash. To do this your new can be a simple proxy to the default new that catches its failures, frees up some stuff and tries again.
The simplest technique is to reserve a blank block of memory at start-up time for that very purpose. You may also have some cache you can tap into - the idea is the same.
When the first allocation failure kicks in, you still have time to warn your user about the low memory conditions ("I'll be able to survive a little longer, but you may want to save your work and close some other applications"), save your state to disk, switch to survival mode, or whatever else makes sense in your context.
The most common use case is probably leak checking.
Another use case is when you have specific requirements for memory allocation in your environment which are not satisfied by the standard library you are using, like, for instance, you need to guarantee that memory allocation is lock free in a multi threaded environment.
As many have already stated this is usually done in performance critical applications, or to be able to control memory alignment or track your memory. Games frequently use custom memory managers, especially when targeting specific platforms/consoles.
Here is a pretty good blog post about one way of doing this and some reasoning.
Overloaded new operator also enables programmers to squeeze some extra performance out of their programs. For example, In a class, to speed up the allocation of new nodes, a list of deleted nodes is maintained so that their memory can be reused when new nodes are allocated.In this case, the overloaded delete operator will add nodes to the list of deleted nodes and the overloaded new operator will allocate memory from this list rather than from the heap to speedup memory allocation. Memory from the heap can be used when the list of deleted nodes is empty.