Dealing with fragmentation in a memory pool? - c++

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?

I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?

Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.

Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.

It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.

write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

Related

Understanding Memory Pools

To my understanding, a memory pool is a block, or multiple blocks of memory allocate on the stack before runtime.
By contrast, to my understanding, dynamic memory is requested from the operating system and then allocated on the heap during run time.
// EDIT //
Memory pools are evidently not necessarily allocated on the stack ie. a memory pool can be used with dynamic memory.
Non dynamic memory is evidently also not necessarily allocated on the stack, as per the answer to this question.
The topics of 'dynamic vs. static memory' and 'memory pools' are thus not really related although the answer is still relevant.
From what I can tell, the purpose of a memory pool is to provide manual management of RAM, where the memory must be tracked and reused by the programmer.
This is theoretically advantageous for performance for a number of reasons:
Dynamic memory becomes fragmented over time
The CPU can parse static blocks of memory faster than dynamic blocks
When the programmer has control over memory, they can choose to free and rebuild data when it is best to do so, according the the specific program.
4. When multithreading, separate pools allow separate threads to operate independently without waiting for the shared heap (Davislor)
Is my understanding of memory pools correct? If so, why does it seem like memory pools are not used very often?
It seems this question is thwart with XY problem and premature optimisation.
You should focus on writing legible code, then using a profiler to perform optimisations if necessary.
Is my understanding of memory pools correct?
Not quite.
... on the stack ...
... on the heap ...
Storage duration is orthogonal to the concept of pools; pools can be allocated to have any of the four storage durations (they are: static, thread, automatic and dynamic storage duration).
The C++ standard doesn't require that any of these go into a stack or a heap; it might be useful to think of all of them as though they go into the same place... after all, they all (commonly) go onto silicon chips!
... allocate ... before runtime ...
What matters is that the allocation of multiple objects occurs before (or at least less often than) those objects are first used; this saves having to allocate each object separately. I assume this is what you meant by "before runtime". When choosing the size of the allocation, the closer you get to the total number of objects required at any given time the less waste from excessive allocation and the less waste from excessive resizing.
If your OS isn't prehistoric, however, the advantages of pools will quickly diminish. You'd probably see this if you used a profiler before and after conducting your optimisation!
Dynamic memory becomes fragmented over time
This may be true for a naive operating system such as Windows 1.0. However, in this day and age objects with allocated storage duration are commonly stored in virtual memory, which periodically gets written to, and read back from disk (this is called paging). As a consequence, fragmented memory can be defragmented and objects, functions and methods that are more commonly used might even end up being united into common pages.
That is, paging forms an implicit pool (and cache prediction) for you!
The CPU can parse static blocks of memory faster than dynamic blocks
While objects allocated with static storage duration commonly are located on the stack, that's not mandated by the C++ standard. It's entirely possible that a C++ implementation may exist where-by static blocks of memory are allocated on the heap, instead.
A cache hit on a dynamic object will be just as fast as a cache hit on a static object. It just so happens that the stack is commonly kept in cache; you should try programming without the stack some time, and you might find that the cache has more room for the heap!
BEFORE you optimise you should ALWAYS use a profiler to measure the most significant bottleneck! Then you should perform the optimisation, and then run the profiler again to make sure the optimisation was a success!
This is not a machine-independent process! You need to optimise per-implementation! An optimisation for one implementation is likely a pessimisation for another.
If so, why does it seem like memory pools are not used very often?
The virtual memory abstraction described above, in conjunction with eliminating guess-work using cache profilers virtually eliminates the usefulness of pools in all but the least-informed (i.e. use a profiler) scenarios.
A customized allocator can help performance since the default allocator is optimized for a specific use case, which is infrequently allocating large chunks of memory.
But let's say for example in a simulator or game, you may have a lot of stuff happening in one frame, allocating and freeing memory very frequently. In this case the default allocator is not as good.
A simple solution can be allocating a block of memory for all the throwaway stuff happening during a frame. This block of memory can be overwritten over and over again, and the deletion can be deferred to a later time. e.g: end of a game level or whatever.
Memory pools are used to implement custom allocators.
One commonly used is a linear allocator. It only keeps a pointer seperating allocated/free memory. Allocating with it is just a matter of incrementing the pointer by the N bytes requested, and returning it's previous value. And deallocation is done by resetting the pointer to the start of the pool.

Why we use memory managers?

I have seen that lot of code bases specially server codes have basic (sometimes advanced) memory managers. Is the real purpose of memory manager is to reduce number of malloc calls or mainly for the purpose of memory analysis, corruption check or may be other application centric purposes.
Is the argument of saving malloc calls reasonable enough as malloc in itself is a memory manager. The only performance gain I can reason is when we know that system always ask for same size memory.
Or the reason for having memory manager is that free does not return memory to OS but saves in the list. So over the lifetime of the process, the heap usage of the process may increase if we keep on doing malloc/free because of fragmentation.
mallocis a general purpose allocator - "not slow" is more important than "always fast".
Consider a feature that would be a 10% improvement in many common cases, but might cause significant performance degradation in a few rare cases. An application specific allocator can avoid the rare case and reap the benefits. A general purpose allocator should not.
Besides number of calls to malloc, there are other relevant attributes:
locality of allocations
On current hardware, this easily the most important factor for performance. An application has more knowledge of the access patterns and can optimize the allocations accordingly.
multithreading
A general purpose allocator must allow calls to malloc and free from different threads. This usually requires a lock or similar concurrency handling. If the heap is very busy, this leads to massive contention.
An application that knows that some high-frequency alloc/frees come only from one thread can use its own thread-specific heap, which not only avoids contention for these allocations, but also increases their locality and takes load off the default allocator.
fragmentation
This is still a problem for long running applications on systems with limited physical memory or address space. Fragmentation may require more and more memory or address space from the OS, even without the actual working set increasing. This is a significant problem for applications that need to run uninterrupted.
Last time I looked deeper into allocators (which is probably half a decade past), the consensus was that naive attempts to reduce fragmentation often conflict with the never slow rule.
Again, an application that knows (some of its) allocation patterns can take a lot of load from the default allocator. One very common use case is building a syntax tree or something similar: there are gazillions of small allocations which are never freed individually, only as a whole. Such a pattern can be served efficiently with a very trivial allocator.
resilence and diagnostics
Last not least the diagnostic and self-protection capabilities of the default allocator may not be sufficient for many applications.
Why do we have custom memory managers rather than the built-in ones?
Number one reason is probably that the codebase was originaly written 20-30years ago when the provided one wasn't any good and nobody dares change it.
But otherwise, as you say because the application needs to manage fragmentation, grab memory at startup to ensure that memory will always be available, for security or a bunch of other reasons - most of which could be acheived by correct use of the built-in manager.
C and C++ are designed to be stripped down. They don't do much that is not explicitly asked for, so when a program asks for memory, it gets the minimum possible effort required to deliver that memory.
In other words, if you don't need it, you don't pay for it.
If finer-grained control of the memory is required, that's the domain of the programmer. If the programmer wishes to trade bare metal speed for a system that will provide higher performance on the target hardware in conjunction with the program's often unique goals, better debugging support, or simply likes the look and feel and warm fuzzies that come from using a manager, that is up to them. The programmer either writes something smarter or finds a third party library to do what they want.
You briefly touched on a lot of the different reasons why you would use a memory manager in your question.
Is the real purpose of a memory manager to reduce the number of malloc calls or mainly for the purpose of memory analysis, corruption check or other application centric purposes?
This is the big question. A memory manager in any application can be generic (like malloc) or it can be more specific. The more specialized the memory manager becomes it is likely to be more efficient at the specific task it is supposed to accomplish.
Take this overly-simplified example:
#define MAX_OBJECTS 1000
Foo globalObjects[MAX_OBJECTS];
int main(int argc, char ** argv)
{
void * mallocObjects[MAX_OBJECTS] = {0};
void * customObjects[MAX_OBJECTS] = {0};
for(int i = 0; i < 1000; ++i)
{
mallocObjects[i] = malloc(sizeof(Foo));
customObjects[i] = &globalObjects[i];
}
}
In the above I am pretending that this global object list is our "custom memory allocator." This is just to simplify what I am explaining.
When you allocate with malloc there is no guarantee it is right next to the previous allocation. Malloc is a general purpose allocator and does a good job at that but doesn't necessarily make the most efficient choice for every application.
With a custom allocator you might be able to up front allocate room for 1000 custom objects and since they are a fixed size return the exact amount of memory you need to prevent fragmentation and to efficiently allocate that block.
There is also the difference between memory abstraction and custom memory allocators. STL allocators are arguably an abstraction model and not a custom memory allocator.
Take a look at this link for some more information on custom allocators and why they are useful: gamedev.net link
There are many reasons why we would want to do this and it really depends on the application itself. In fact all the reasons you mentioned are valid.
I once built a very simple memory manager that kept track of shared_ptr allocations in order for me to see what was not being released properly on application end.
I would say stick to your runtime unless you need something that it does not provide.
Memory managers are used basically to manage efficiently your memory reservation. Normally processes have access to a limited amount of memory (4GB in 32bits systems), from this you have to subtract the virtual memory space reserved for the kernel (1GB or 2GB depending on your OS configuration). Thus, virtually the process has access let's say to 3GB of memory that will be used to hold all of its segments (code, data, bss, heap and stack).
Memory managers (malloc for example) try to fulfill the different memory reservation requests issued by the process by requesting new memory pages to the OS (using sbrk or mmap system calls). Every time this happens it implies an extra cost on the program execution since the OS has to look for a suitable memory page to be assigned to the process (Physical memory is limited and all the running processes want to use it), update the process tables (TMP, etc). These operations are time consuming and hit the process execution and performance. Thus, the memory manager normally try to request the needed pages to fulfill the process reservations cleverly. For example it could ask for some more pages to avoid calling more mmap calls in the near future. Additionally, it tries to deal with issues like fragmentation, memory alignment, etc. This basically unloads the process from this responsibility, otherwise everybody writing some program that needs dynamic memory allocation has to perform this manually!
Actually, there are cases where one could be interested in doing the memory management manually. This is the case for embedded or high availability systems which have to run for 24/365. In these cases even if the memory fragmentation is low it could become a problem after very long period of running (1 year for example). So, one of the solutions that are used in this case is to use a memory pool to allocate before hand the memory for the application objects. After-wards each time you need memory for some object you just use the already reserved memory.
For server based or any application that needs to run for long periods of time or indefinitely, the main issue is paged memory fragmentation. After a long series of mallocs / new and free / delete, paged memory can end up with gaps in the pages that waste space and could eventually run out of virtual address space. Microsoft deals with this with it's .NET framework, by occasionally pausing a process to repack paged memory for a process.
To avoid slowdown during repacking of memory in a process, a server type application can use multiple processes for the application, so that during repacking of one process, the other process(es) take more of the load.

Heap optimized for (but not limited to) single-threaded usage

I use a custom heap implementation in one of my projects. It consists of two major parts:
Fixed size-block heap. I.e. a heap that allocates blocks of a specific size only. It allocates larger memory blocks (either virtual memory pages or from another heap), and then divides them into atomic allocation units.
It performs allocation/freeing fast (in O(1)) and there's no memory usage overhead, not taking into account things imposed by the external heap.
Global general-purpose heap. It consists of buckets of the above (fixed-size) heaps. WRT the requested allocation size it chooses the appropriate bucket, and performs the allocation via it.
Since the whole application is (heavily) multi-threaded - the global heap locks the appropriate bucket during its operation.
Note: in contrast to the traditional heaps, this heap requires the allocation size not only for the allocation, but also for freeing. This allows to identify the appropriate bucket without searches or extra memory overhead (such as saving the block size preceding the allocated block). Though somewhat less convenient, this is ok in my case. Moreover, since the "bucket configuration" is known at compile-time (implemented via C++ template voodoo) - the appropriate bucket is determined at compile time.
So far everything looks (and works) good.
Recently I worked on an algorithm that performs heap operations heavily, and naturally affected significantly by the heap performance. Profiling revealed that its performance is considerably impacted by the locking. That is, the heap itself works very fast (typical allocation involves just a few memory dereferencing instructions), but since the whole application is multi-threaded - the appropriate bucket is protected by the critical section, which relies on interlocked instructions, which are much heavier.
I've fixed this meanwhile by giving this algorithm its own dedicated heap, which is not protected by a critical section. But this imposes several problems/restrictions at the code level. Such as the need to pass the context information deep within the stack wherever the heap may be necessary. One may also use TLS to avoid this, but this may cause some problems with re-entrance in my specific case.
This makes me wonder: Is there a known technique to optimize the heap for (but not limit to) single-threaded usage?
EDIT:
Special thanks to #Voo for suggesting checking out the google's tcmalloc.
It seems to work similar to what I did more-or-less (at least for small objects). But in addition they solve the exact issue I have, by maintaining per-thread caching.
I too thought in this direction, but I thought about maintaining per-thread heaps. Then freeing a memory block allocated from the heap belonging to another thread is somewhat tricky: one should insert it in a sort of a locked queue, and that other thread should be notified, and free the pending allocations asynchronously. Asynchronous deallocation may cause problems: if that thread is busy for some reason (for instance performs an aggressive calculations) - no memory deallocation actually occurs. Plus in multi-threaded scenario the cost of deallocation is significantly higher.
OTOH the idea with caching seems much simpler, and more efficient. I'll try to work it out.
Thanks a lot.
P.S.:
Indeed google's tcmalloc is great. I believe it's implemented pretty much similar to what I did (at least fixed-size part).
But, to be pedantic, there's one matter where my heap is superior. According to docs, tcmalloc imposes an overhead roughly 1% (asymptotically), whereas my overhead is 0.0061%. It's 4/64K to be exact.
:)
One thought is to maintain a memory allocator per-thread. Pre-assign fairly chunky blocks of memory to each allocator from a global memory pool. Design your algorithm to assign the chunky blocks from adjacent memory addresses (more on that later).
When the allocator for a given thread is low on memory, it requests more memory from the global memory pool. This operation requires a lock, but should occur far less frequently than in your current case. When the allocator for a given thread frees it's last byte, return all memory for that allocator to the global memory pool (assume thread is terminated).
This approach will tend to exhaust memory earlier than your current approach (memory can be reserved for one thread that never needs it). The extent to which that is an issue depends on the thread creation / lifetime / destruction profile of your app(s). You can mitigate that at the expense of additional complexity, e.g. by introducing a signal that a memory allocator for given thread is out of memory, and the global pool is exhaused, that other memory allocators can respond to by freeing some memory.
An advantage of this scheme is that it will tend to eliminate false sharing, as memory for a given thread will tend to be allocated in contiguous address spaces.
On a side note, if you have not already read it, I suggest IBM's Inside Memory Management article for anyone implementing their own memory management.
UPDATE
If the goal is to have very fast memory allocation optimized for a multi-threaded environment (as opposed to learning how to do it yourself), have a look at alternate memory allocators. If the goal is learning, perhaps check out their source code.
Hoarde
tcmalloc (thanks Voo)
It might be a good idea to read Jeff Bonwicks classic papers on the slab allocator and vmem. The original slab allocator sounds somewhat what you're doing. Although not very multithread friendly it might give you some ideas.
The Slab Allocator: An Object-Caching Kernel Memory Allocator
Then he extended the concept with VMEM, which will definitely give you some ideas since it had very nice behavior in a multi cpu environment.
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources

questions about memory pool

I need some clarifications for the concept & implementation on memory pool.
By memory pool on wiki, it says that
also called fixed-size-blocks allocation, ... ,
as those implementations suffer from fragmentation because of variable
block sizes, it can be impossible to use them in a real time system
due to performance.
How "variable block size causes fragmentation" happens? How fixed sized allocation can solve this? This wiki description sounds a bit misleading to me. I think fragmentation is not avoided by fixed sized allocation or caused by variable size. In memory pool context, fragmentation is avoided by specific designed memory allocators for specific application, or reduced by restrictly using an intended block of memory.
Also by several implementation samples, e.g., Code Sample 1 and Code Sample 2, it seems to me, to use memory pool, the developer has to know the data type very well, then cut, split, or organize the data into the linked memory chunks (if data is close to linked list) or hierarchical linked chunks (if data is more hierarchical organized, like files). Besides, it seems the developer has to predict in prior how much memory he needs.
Well, I could imagine this works well for an array of primitive data. What about C++ non-primitive data classes, in which the memory model is not that evident? Even for primitive data, should the developer consider the data type alignment?
Is there good memory pool library for C and C++?
Thanks for any comments!
Variable block size indeed causes fragmentation. Look at the picture that I am attaching:
The image (from here) shows a situation in which A, B, and C allocates chunks of memory, variable sized chunks.
At some point, B frees all its chunks of memory, and suddenly you have fragmentation. E.g., if C needed to allocate a large chunk of memory, that still would fit into available memory, it could not do because available memory is split in two blocks.
Now, if you think about the case where each chunk of memory would be of the same size, this situation would clearly not arise.
Memory pools, of course, have their own drawbacks, as you yourself point out. So you should not think that a memory pool is a magical wand. It has a cost and it makes sense to pay it under specific circumstances (i.e., embedded system with limited memory, real time constraints and so on).
As to which memory pool is good in C++, I would say that it depends. I have used one under VxWorks that was provided by the OS; in a sense, a good memory pool is effective when it is tightly integrated with the OS. Actually each RTOS offers an implementation of memory pools, I guess.
If you are looking for a generic memory pool implementation, look at this.
EDIT:
From you last comment, it seems to me that possibly you are thinking of memory pools as "the" solution to the problem of fragmentation. Unfortunately, this is not the case. If you want, fragmentation is the manifestation of entropy at the memory level, i.e., it is inevitable. On the other hand, memory pools are a way to manage memory in such a way as to effectively reduce the impact of fragmentation (as I said, and as wikipedia mentioned, mostly on specific systems like real time systems). This comes to a cost, since a memory pool can be less efficient than a "normal" memory allocation technique in that you have a minimum block size. In other words, the entropy reappears under disguise.
Furthermore, that are many parameters that affect the efficiency of a memory pool system, like block size, block allocation policy, or whether you have just one memory pool or you have several memory pools with different block sizes, different lifetimes or different policies.
Memory management is really a complex matter and memory pools are just a technique that, like any other, improves things in comparison to other techniques and exact a cost of its own.
In a scenario where you always allocate fixed-size blocks, you either have enough space for one more block, or you don't. If you have, the block fits in the available space, because all free or used spaces are of the same size. Fragmentation is not a problem.
In a scenario with variable-size blocks, you can end up with multiple separate free blocks with varying sizes. A request for a block of a size that is less than the total memory that is free may be impossible to be satisfied, because there isn't one contiguous block big enough for it. For example, imagine you end up with two separate free blocks of 2KB, and need to satisfy a request for 3KB. Neither of these blocks will be enough to provide for that, even though there is enough memory available.
Both fix-size and variable size memory pools will feature fragmentation, i.e. there will be some free memory chunks between used ones.
For variable size, this might cause problems, since there might not be a free chunk that is big enough for a certain requested size.
For fixed-size pools, on the other hand, this is not a problem, since only portions of the pre-defined size can be requested. If there is free space, it is guaranteed to be large enough for (a multiple of) one portion.
If you do a hard real time system, you might need to know in advance that you can allocate memory within the maximum time allowed. That can be "solved" with fixed size memory pools.
I once worked on a military system, where we had to calculate the maximum possible number of memory blocks of each size that the system could ever possibly use. Then those numbers were added to a grand total, and the system was configured with that amount of memory.
Crazily expensive, but worked for the defence.
When you have several fixed size pools, you can get a secondary fragmentation where your pool is out of blocks even though there is plenty of space in some other pool. How do you share that?
With a memory pool, operations might work like this:
Store a global variable that is a list of available objects (initially empty).
To get a new object, try to return one from the global list of available. If there isn't one, then call operator new to allocate a new object on the heap. Allocation is extremely fast which is important for some applications that might currently be spending a lot of CPU time on memory allocations.
To free an object, simply add it to the global list of available objects. You might place a cap on the number of items allowed in the global list; if the cap is reached then the object would be freed instead of returned to the list. The cap prevents the appearance of a massive memory leak.
Note that this is always done for a single data type of the same size; it doesn't work for larger ones and then you probably need to use the heap as usual.
It's very easy to implement; we use this strategy in our application. This causes a bunch of memory allocations at the beginning of the program, but no more memory freeing/allocating occurs which incurs significant overhead.

How to implement a memory heap

Wasn't exactly sure how to phrase the title, but the question is:
I've heard of programmers allocating a large section of contiguous memory at the start of a program and then dealing it out as necessary. This is, in contrast to simply going to the OS every time memory is needed.
I've heard that this would be faster because it would avoid the cost of asking the OS for contiguous blocks of memory constantly.
I believe the JVM does just this, maintaining its own section of memory and then allocating objects from that.
My question is, how would one actually implement this?
Most C and C++ compilers already provide a heap memory-manager as part of the standard library, so you don't need to do anything at all in order to avoid hitting the OS with every request.
If you want to improve performance, there are a number of improved allocators around that you can simply link with and go. e.g. Hoard, which wheaties mentioned in a now-deleted answer (which actually was quite good -- wheaties, why'd you delete it?).
If you want to write your own heap manager as a learning exercise, here are the basic things it needs to do:
Request a big block of memory from the OS
Keep a linked list of the free blocks
When an allocation request comes in:
search the list for a block that's big enough for the requested size plus some book-keeping variables stored alongside.
split off a big enough chunk of the block for the current request, put the rest back in the free list
if no block is big enough, go back to the OS and ask for another big chunk
When a deallocation request comes in
read the header to find out the size
add the newly freed block onto the free list
optionally, see if the memory immediately following is also listed on the free list, and combine both adjacent blocks into one bigger one (called coalescing the heap)
You allocate a chunk of memory at the beginning of the program large enough to sustain its need. Then you have to override new and/or malloc, delete and/or free to return memory from/to this buffer.
When implementing this kind of solution, you need to write your own allocator(to source from the chunk) and you may end up using more than one allocator which is often why you allocate a memory pool in the first place.
Default memory allocator is a good all around allocator but is not the best for all allocation needs. For example, if you know you'll be allocating a lot of object for a particular size, you may define an allocator that allocates fixed size buffer and pre-allocate more than one to gain some efficiency.
Here is the classic allocator, and one of the best for non-multithreaded use:
http://gee.cs.oswego.edu/dl/html/malloc.html
You can learn a lot from reading the explanation of its design. The link to malloc.c in the article is rotted; it can now be found at http://gee.cs.oswego.edu/pub/misc/malloc.c.
With that said, unless your program has really unusual allocation patterns, it's probably a very bad idea to write your own allocator or use a custom one. Especially if you're trying to replace the system malloc, you risk all kinds of bugs and compatibility issues from different libraries (or standard library functions) getting linked to the "wrong version of malloc".
If you find yourself needing specialized allocation for just a few specific tasks, that can be done without replacing malloc. I would recommend looking up GNU obstack and object pools for fixed-sized objects. These cover a majority of the cases where specialized allocation might have real practical usefulness.
Yes, both stdlib heap and OS heap / virtual memory are pretty troublesome.
OS calls are really slow, and stdlib is faster, but still has some "unnecessary"
locks and checks, and adds a significant overhead to allocated blocks
(ie some memory is used for management, in addition to what you allocate).
In many cases its possible to avoid dynamic allocation completely,
by using static structures instead. For example, sometimes its better (safer etc) to define a 64k
static buffer for unicode filename, than define a pointer/std:string and dynamically
allocate it.
When the program has to allocate a lot of instances of the same structure, its
much faster to allocate large memory blocks and then just store the instances there
(sequentially or using a linked list of free nodes) - C++ has a "placement new" for that.
In many cases, when working with varible-size objects, the set of possible sizes
is actually very limited (eg. something like 4+2*(1..256)), so its possible to use
a few pools like [3] without having to collect garbage, fill the gaps etc.
Its common for a custom allocator for specific task to be much faster than one(s)
from standard library, and even faster than speed-optimized, but too universal implementations.
Modern CPUs/OSes support "large pages", which can significantly improve the memory
access speed when you explicitly work with large blocks - see http://7-max.com/
IBM developerWorks has a nice article about memory management, with an extensive resources section for further reading: Inside memory management.
Wikipedia has some good information as well: C dynamic memory allocation, Memory management.