Defragmenting C++ Heap Allocator & STL

Defragmenting C++ Heap Allocator & STL - c++

I'm looking to write a self defragmenting memory manager whereby a simple incrementing heap allocator is used in combination with a simple compacting defragmenter.
The rough scheme would be to allocate blocks starting at the lowest memory address going upwards and keeping book-keeping information starting at the highest memory address working downwards.
The memory manager would pass back smart pointers - boost's intrusive_ptr's seems the most obvious to the book-keeping structs that would then themselves point to the actual memory block thus giving a level of indirection so that the blocks can be easily moved around.
The defragmenter would compact down the heap starting at 'generation' bookmarks to speed up the process and only defragmenting a fixed amount of memory at a time. Raw pointers to the blocks themselves would be valid until the next defrag pass and so could be passed around freely until such a time improving performance.
The specific application for this is console game programming and so at the beginning or end of each frame a defrag pass could be done relatively safely.
So my question is has anybody used this kind of allocation scheme in combination with STL would it just completely blow STL apart as I suspect. I can see std::list< intrusive_ptr > working at the intrusive_ptr level but what about the allocation of the stl list nodes themselves is there anyway to override the next/prev pointers to be intrusive_ptr's themselves or am I just going to have to have a standard heap allocator along side this more dynamic one.

If you're going to be moving objects around in memory then you can't do this fully generically. You will only be able to do this with objects that know that they might be moved. You also will need a locking mechanism. When a function is being called on an object, then it can't be moved.
The reason is that the whole C++ model relies on objects sitting at fixed points in memory, so if a thread was calling a method on an object, this thread was paused and the object moved, disaster would strike when the thread resumed.
Any object which held a raw memory pointer to another object that might be moved (including a sub-object of itself) would not work.
Such a memory management scheme may work but you have to be very careful. You need to be strict about implementing handles, and the handle->pointer locking semantics.
For STL containers, you can customize the allocator, but it still needs to return fixed raw memory pointers. You can't return an address that might move. For this reason, if you're using STL containers, they must be containers of handles, and the nodes themselves will be ordinary dynamically allocated memory. You may find that you too much in overhead in the handle indirection and still have problems in the fragmentation of the handle collections than you gain by using STL.
Using containers that understand your handles directly might be the only way forward, and even then there may still be a lot of overhead compared to a C++ application that uses traditional objects fixed in memory.

STL containers are implemented using naked pointers.
You can specify a custom allocator when you instantiate them (so they they initialize their pointers using your allocator), but (because the allocated values are stored in naked pointers) you don't know where those pointers are, and therefore you can't change them later.
Instead, you might consider implementing a subset of the STL yourself: your versions of the STL containers could then be implemented with managed pointers.

An alternative technique which is fairly well known is the buddy system. You should take a look at that for additional inspiration.

If this is for console game programming it's a lot easier to forbid un-scoped dynamic memory allocations at runtime. And at startup time, but that's a bit difficult to achieve.

My take on this, is that if have to be afraid of fragmentation, that means you are juggling around with data pieces which are a huge fraction of your memory, and by this virtue alone, you cannot have many of them. Do you already know what these will be? Maybe it would be better to step down a level and make more specific decisions, thus impeding less on the other code and the general performance of your application?
A list is an exceptionally bad example to put into a defragmenting memory manager, because it's a bunch of tiny pieces, as are most other STL data structures. If you do this, it will have all kinds of obvious bad implications - including the performance of your defragmenter going down, also the indirection cost etc. The only structures where it makes sense IMO are contigious ones - array, deque, main chunk of hashtable, those things, and only beyond a certain size, and only after they are not gonna be resized any longer. These kind of things call, again, for specific solutions, instead of generic ones.
Comment back on how it all turns out.

Related

Memory defragmentation/heap compaction - commonplace in managed languages, but not in C++. Why?

I've been reading up a little on zero-pause garbage collectors for managed languages. From what I understand, one of the most difficult things to do without stop-the-world pauses is heap compaction. Only very few collectors (eg Azul C4, ZGC) seem to be doing, or at least approaching, this.
So, most GCs introduce dreaded stop-the-world pauses the compact the heap (bad!). Not doing this seems extremely difficult, and does come with a performance/throughput penalty. So either way, this step seems rather problematic.
And yet - as far as I know, most if not all GCs still do compact the heap occasionally. I've yet to see a modern GC that doesn't do this by default. Which leads me to believe: It has to be really, really important. If it wasn't, surely, the tradeoff wouldn't be worth it.
At the same time, I have never seen anyone do memory defragmentation in C++. I'm sure some people somewhere do, but - correct me if I am wrong - it does not at all seem to be a common concern.
I could of course imagine static memory somewhat lessens this, but surely, most codebases would do a fair amount of dynamic allocations?!
So I'm curious, why is that?
Are my assumptions (very important in managed languages; rarely done in C++) even correct? If yes, is there any explanation I'm missing?

Garbage collection can compact the heap because it knows where all of the pointers are. After all, it just finished tracing them. That means that it can move objects around and adjust the pointers (references) to the new location.
However, C++ cannot do that, because it doesn't know where all the pointers are. If the memory allocation library moved things around, there could be dangling pointers to the old locations.
Oh, and for long running processes, C++ can indeed suffer from memory fragmentation. This was more of a problem on 32-bit systems because it could fail to allocate memory from the OS, because it might have used up all of the available 1 MB memory blocks. In 64-bit it is almost impossible to create so many memory mappings that there is nowhere to put a new one. However, if you ended up with a 16 byte memory allocation in each 4K memory page, that's a lot of wasted space.
C and C++ applications solve that by using storage pools. For a web server, for example, it would start a pool with a new request. At the end of that web request, everything in the pool gets destroyed. The pool makes a nice, constant sized block of RAM that gets reused over and over without fragmentation.
Garbage collection tends to use recycling pools as well, because it avoids the strain of running a big GC trace and reclaim at the end of a connection.
One method some old operating systems like Apple OS 9 used before virtual memory was a thing is handles. Instead of a memory pointer, allocation returned a handle. That handle was a pointer to the real object in memory. When the operating system needed to compact memory or swap it to disk it would change the handle.
I have actually implemented a similar system in C++ using an array of handles into a shared memory map psuedo-database. When the map was compacted then the handle table was scanned for affected entries and updated.

Generic memory compaction is not generally useful nor desirable because of its costs.
What may be desirable is to have no wasted/fragmented memory and that can be achieved by other methods than memory compaction.
In C++ one can come up with a different allocation approach for objects that do cause fragmentation in their specific application, e.g. double-pointers or double-indexes to allow for object relocation; object pools or arenas that prevent or minimize fragmentation. Such solutions for specific object types is superior to generic garbage collection because they employ application/business specific knowledge which allows to minimize the scope/cost of object storage maintenance and also happen at most appropriate times.
A research found that garbage collected languages require 5 times more memory to achieve performance of non-GC equivalent programs. Memory fragmentation is more severe in GC languages.

How do STL linked structures handle allocation?

The C++ Standard Template Library provides a number of container types which have very obvious implementations as linked structures, such as list and map.
A very basic optimization with highly-linked structures is to use a custom sub-allocator with a private memory pool providing fixed-size allocation. Given the STL's emphasis on performance, I would expect this or similar optimizations to be performed. At the same time, all these containers have an optional Allocator template parameter, and it would seem largely redundant to be able to provide a custom allocator to something that already uses a custom allocator.
So, if I'm looking for maximum-performance linked structures with the STL, do I need to specify a custom allocator, or can I count on the STL to do that work for me?

It depends a lot on your workload.
If you don't iterate through your datastructures a lot, don't even bother optimizing anything. Your time is better spent elsewhere.
If you do iterate, but your payload is large and you do a lot of work per item, it's unlikely that the default implementation is going to be the bottleneck. The iteration inefficiencies are going to be swallowed by the per item work.
If you store small elements (ints, pointers), you do trivial operations and you iterate through the structure a lot, then you'll get better performance out of something like std::vector or boost::flat_map since they allow better pre-fetch operations.
Allocators are mostly useful when you find yourself allocating and deallocating a lot of small bits of memory. This causes memory fragmentation and can have performance implications.
As with all performance advice you need to benchmark your workload on your target machine.
P.S. Make sure the optimizations are turned on (i.e. -O3).

Naturally it can vary from one standard lib implementation to the next, but last time I checked in libs like MSVC, GNU C++, and EASTL, linked structures allocate node and element data in a single allocation.
However, each node is still allocated one-at-a-time against std::allocator which is a fairly general-purpose variable-length allocator (though it can at least assume that all elements being allocated are of a particular data type, but many times I've found it just defaulting to malloc calls in VTune and CodeXL sessions). Sometimes there's even thread-safe memory allocation going on which is a bit wasteful when the data structure itself isn't designed for concurrent modifications or simultaneous reads/writes while using thread-safe, general-purpose memory allocation to allocate one node at a time.
The design makes sense though if you want to allow the client to pass in their own custom allocators as template parameters. In that case you don't want the data structure to pool memory since that would be fighting against what the allocator might want to do. There's a decision to be made with linked structures in particular whether you allocate one node at a time and pass the responsibility of more efficient allocation techniques like free lists to the allocator to make each individual node allocation efficient, or avoid depending on the allocator to make that efficient and effectively do the pooling in the data structure by allocating many nodes at once in a contiguous fashion and pooling them. The standard lib leans towards the former route which can unfortunately makes things like std::list and std::map, when used against default std::allocator, very inefficient.
Personally for linked structures, I use my own handrolled solutions which rely on 32-bit indexes into arrays (ex: std::vector) which effectively serve as the pooled memory and work like an "indexed free list", like so:
... where we might actually store the linked list nodes inside std::vector. The links just become a way to allow us to remove things in constant-time and reclaim these empty spaces in constant-time. The real example is a little more complex than the pseudocode above since that code only works for PODs (real one uses aligned_storage, placement new, and manual dtor invocation like the standard containers), but it's not that much more complex. Similar case here:
... a doubly-linked "index" list using std::vector (or something like std::deque if you don't want pointer invalidation), for example, to store the list nodes. In that case the links allow us to just skip around when traversing the vector's contiguous memory. The whole point of that is to allow constant-time removal and insertion anywhere to the list while preserving insertion order on traversal (something which would be lost with just std::vector if we used swap-to-back-and-pop-back technique for constant-time removal from the middle).
Aside from making everything more contiguous and cache-friendly to traverse as well as faster to allocate and free, it also halves the sizes of the links on 64-bit architectures when we can use 32-bit indices instead into a random-access sequence storing the nodes.
Linked lists have actually accumulated a really bad rep in C++ and I believe it's largely for this reason. The people benchmarking are using std::list against the default allocator and incurring bottlenecks in the form of cache misses galore on traversal and costly, possibly thread-safe memory allocations with the insertion of each individual node and freeing with the removal of each individual node. Similar case with the huge preference these days towards unordered_map and unordered_set over map and set. Hash tables might have always had some edge but that edge is so skewed when map and set just use a general-purpose allocator one node at a time and incur cache misses galore on tree traversal.
So, if I'm looking for maximum-performance linked structures with the
STL, do I need to specify a custom allocator, or can I count on the
STL to do that work for me?
The advice to measure/profile is always wise, but if your needs are genuinely critical (like you're looping over the data repeatedly every single frame and it stores hundreds of thousands of elements or more while repeatedly also inserting and removing elements to/from the middle each frame), then I'd at least reach for a free list before using the likes of std::list or std::map. And linked lists are such trivial data structures that I'd actually recommend rolling your own if you're really hitting hotspots with the linked structures in the standard library instead of having to deal with the combo of both allocator and data structure to achieve an efficient solution (it can be easier to just have a data structure which is very efficient for your exact needs in its default form if it's trivial enough to implement).
I used to fiddle around with allocators a lot, reaching around data structures and trying to make them more efficient by experimenting with allocators (with moderate success, enough to encourage me but not amazing results), but I've found my life becoming so much easier to just make linked structures which pool their memory upfront (which gave me the most amazing results). And funnily enough, just creating these data structures which are more efficient about their allocation strategies upfront took less time than all the time I spent fiddling with allocators (trying out third party ones as well as implementing my own). Here's a quick example I whipped up which uses linked lists for the collision detection with 4 million particles (this is old so it was running on an i3).
It uses singly-linked lists using my own deque-like container to store the nodes like so:
Similar thing here with a spatial index for collision between 500k variable-sized agents (just took 2 hours to implement the whole thing and I didn't even bother to multithread it):
I point this out mainly for those who say linked lists are so inefficient since, as long as you store the nodes in an efficient and relatively contiguous way, they can really be a useful tool to add to your arsenal. I think the C++ community at large dismissed them a bit too hastily as I'd be completely lost without linked lists. Used correctly, they can reduce heap allocations rather than multiply them and improve spatial locality rather than degrade it (ex: consider that grid diagram above if it used a separate instance of std::vector or SmallVector with a fixed SBO for every single cell instead of just storing one 32-bit integer). And it doesn't take long to write, say, a linked list which allocates nodes very efficiently -- I'd be surprised if anyone takes more than a half hour to write both the data structure and unit test. Similar case with, say, an efficient red-black tree which might take a couple of hours but it's not that big of a deal.
These days I just end up storing the linked nodes directly inside things like std::vector, my own chunkier equivalent of std::deque, tbb::concurrent_vector if I need to build a concurrent linked structure, etc. Life becomes a lot easier when efficient allocation is absorbed into the data structure's responsibility rather than having to think about efficient allocation and the data structure as two entirely separate concepts and having to whip up and pass all these different types of allocators around all over the place. The design I favor these days is something like:
// Creates a tree storing elements of type T with 'BlockSize' contiguous
// nodes allocated at a time and pooled.
Tree<T, BlockSize> tree;
... or I just omit that BlockSize parameter and let the nodes be stored in std::vector with amortized reallocations while storing all nodes contiguously. I don't even bother with an allocator template parameter anymore. Once you absorb efficient node allocation responsibilities into the tree structure, there's no longer much benefit to a class template for your allocator since it just becomes like a malloc and free interface and dynamic dispatch becomes trivially cheap at that point when you're only involving it once for, say, every 128 nodes allocated/freed at once contiguously if you still need a custom allocator for some reason.
So, if I'm looking for maximum-performance linked structures with the
STL, do I need to specify a custom allocator, or can I count on the
STL to do that work for me?
So coming back to this question, if you genuinely have a very performance-critical need (either anticipated upfront like large amounts of data you have to process every frame or in hindsight through measurements), you might even consider just rolling some data structures of your own which store nodes in things like std::vector. As counter-productive as that sounds, it can take a lot less time than fiddling around and experimenting with memory allocators all day long, not to mention an "indexed linked list" which allocates nodes into std::vector using 32-bit indices for the links will halve the cost of the links and also probably take less time to implement than an std::allocator-conforming free list, e.g. And hopefully if people do this more often, linked lists can start to become a bit more popular again, since I think they've become too easily dismissed as inefficient when, used in a way that allocates nodes efficiently, they might actually be an excellent data structure for certain problems.

While the standard does not explicitly forbid such optimizations, it would be a poor design choice by the implementer.
First of all, one could imagine a use case where pooling allocation would not be a desirable choice.
It's not too hard to refer a custom allocator in the template parameters to introduce the pooling behavior you want, but disabling that behavior if it was a part of the container would be pretty much impossible.
Also from the OOP point of view, you'd have a template that obviously has more than one responsibility, some consider it a bad sign.
The overall answer seems to be "Yes, you do need a custom allocator" (Boost::pool_alloc?).
Finally, you can write a simple test to check what does your specific implementation do.

Can STL help addressing memory fragmentation

This is regarding a new TCP server being developed (in C++ on Windows/VC2010)
Thousands of clients connect and keep sending enormous asynchronous requests. I am storing incoming requests in raw linked list ('C' style linked-list of structures, where each structure is a request) and process them one by one in synchronized threads.
I am using in new and delete to create/destroy those request structures.
Till date I was under impression its most efficient approach. But recently I found even after all clients were disconnected, Private Bytes of server process still showed lots of memory consumption (around 45 MB) It never came back to it's original level.
I dig around a lot and made sure there are no memory leaks. Finally, I came across this and realized its because of memory fragmentation caused of lots of new and delete calls.
Now my couple of questions are:
If I replace my raw linked list with STL data structures to store incoming requests, will it help me getting rid of memory fragmentation ? (Because as per my knowledge STL uses contiguous blocks. Kind of its own memory management resulting in very less fragmentation. But I am not sure if this is true.)
What would be performance impact in that case as compared to raw linked list?

I suspect your main problem is that you are using linked lists. Linked lists are horrible for this sort of thing and cause exactly the problem you are seeing. Many years ago, I wrote TCP code that did very similar things, in plain old C. The way to deal with this is to use dynamic arrays. You end up with far fewer allocations.
In those bad old days, I rolled my own, which is actually quite simple. Just allocate a single data structure for some number of records, say ten. When you are about to overflow, double the size, reallocating and copying. Because you increase the size exponentially, you will never have more than a handful of allocations, making fragmentation a non-issue. In addition, you have none of the overhead that comes with list.
Really, lists should almost never be used.
Now in terms of your actual question, yes, the STL should help you, but DON'T use std:list. Use std:vector in the manner I just outlined. In my experience, in 95% of the cases, std:list is an inferior choice.
If you use std:vector, you may want to use vector::reserve to preallocate the number of records you expect you may see. It'll save you a few allocations.

Have you seen that your memory usage and fragmentation is causing you performance problems? I would think it is more from doing new / delete a lot. STL probably won't help unless you use your own allocator and pre-allocate a large chunk and manage it yourself. In other words, it will require a lot of work.
It's often OK to use up memory if you have it. You may want to consider pooling your request structures so you don't need to reallocate them. Then you can allocate on demand and add them to your pool.

Maybe. std::list allocates each node dynamically like a homebrew linked list. "STL uses contiguous block.." - this is not true. You could try std::vector which is like an array and therefore will cause less memory fragmentation. Depends on what you need the data structure for.
I wouldn't expect any discernable difference in performance between a (well-implemented) homebrew linked list and std::list. If you need a stack, std::vector is much more efficient and if you need a queue (eg fifo) then std::deque is much more efficient than linked lists.
If you are serious about preventing memory fragmentation, you will need to manage your own memory and custom allocators or use some third party library. It's not a trivial task.

Instead of raw pointers you can use std::unique_ptr. It has minimal overhead, and makes sure your pointers get deleted.
In my opinion there are pretty few cases where a linked list is the right choice of data structure. You need to chose your data structure based on the way you use your data. For example using a vector will keep your data together, which is good for cache, if you can manage to add/remove elements to it's end, then you avoid fragmentation.
If you want to avoid the overhead of new/deletes you can pool your objects. This way you still need to handle fragmentation.

Why shouldn't I use shared_ptr and unique_ptr always and instead use normal pointers?

I have a background in C# and obj-c so RC/GC are things I (still) hold dear to me. As I started learning C++ in more depth, I can't stop wondering why I would use normal pointers when they are so unmanaged instead of other alternative solutions?
the shared_ptr provides a great way to store references and not lose track of them without deleting them. I can see practical approaches for normal pointers but they seem just bad practices.
Can someone make of case of these alternatives?

Ofcourse you're encouraged to use shared and unique ptr IF they're owning pointers. If you only need an observer however, a raw pointer will do just fine (a pointer bearing no responsibility for whatever it points to).
There is basically no overhead for std::uniqe_ptr and there is some in std::shared_ptr as it does reference counting for you, but rarely will you ever need to save on execution time here.
Also, there is no need for "smart" pointers if you can guarantee lifetime / ownership hierarchy by design; say a parent node in a tree outliving its children - although this is slightly related to the fact whether the pointer actually owns something.

As someone else mentioned, in C++ you have to consider ownership. That being said, the 3D networked multiplayer FPS I'm currently working on has an official rule called "No new or delete." It uses only shared and unique pointers for designating ownership, and raw pointers retrieved from them (using .get()) everywhere that we need to interact with C API's. The performance hit is not noticeable. I use this as an example to illustrate the negligible performance hit since games/simulations typically have the strictest performance requirements.
This has also significantly reduced the amount of time spent debugging and hunting down memory leaks. In theory, a well-designed application would never run into these problems. In real life working with deadlines, legacy systems, or existing game engines that were poorly designed, however, they are an inevitability on large projects like games... unless you use smart pointers. If you must dynamically allocate, don't have ample time for designing/rewriting the architecture or debugging problems related to resource management, and you want to get it off the ground as quickly as possible, smart pointers are the way to go and incur no noticeable performance cost even in large-scale games.

The question is more the opposite: why should you use smart
pointers? Unlike C# (and I think Obj-C), C++ makes extensive
use of value semantics, which means that unless an object has an
application specific lifetime (in which case, none of the smart
pointers apply), you will normally be using value semantics, and
no dynamic allocation. There are exceptions, but if you make it
a point of thinking in terms of value semantics (e.g. like
int), of defining appropriate copy constructors and assignment
operators where necessary, and not allocating dynamically
unless the object has an externally defined lifetime, you'll
find that you rarely need to do anything more; everything just
takes care of itself. Using smart pointers is very much the
exception in most well written C++.

Sometimes you need to interoperate with C APIs, in which case you'll need to use raw pointers
for at least those parts of the code.

In embedded systems, pointers are often used to access registers of hardware chips or memory at specific addresses.
Since the hardware registers already exist, there is no need to dynamically allocate them. There won't be any memory leaks if the pointer is deleted or its value changed.
Similarly with function pointers. Functions are not dynamically allocated and they have "fixed" addresses. Reassigning a function pointer or deleting one will not cause a memory leak.

How to implement a memory heap

Wasn't exactly sure how to phrase the title, but the question is:
I've heard of programmers allocating a large section of contiguous memory at the start of a program and then dealing it out as necessary. This is, in contrast to simply going to the OS every time memory is needed.
I've heard that this would be faster because it would avoid the cost of asking the OS for contiguous blocks of memory constantly.
I believe the JVM does just this, maintaining its own section of memory and then allocating objects from that.
My question is, how would one actually implement this?

Most C and C++ compilers already provide a heap memory-manager as part of the standard library, so you don't need to do anything at all in order to avoid hitting the OS with every request.
If you want to improve performance, there are a number of improved allocators around that you can simply link with and go. e.g. Hoard, which wheaties mentioned in a now-deleted answer (which actually was quite good -- wheaties, why'd you delete it?).
If you want to write your own heap manager as a learning exercise, here are the basic things it needs to do:
Request a big block of memory from the OS
Keep a linked list of the free blocks
When an allocation request comes in:
search the list for a block that's big enough for the requested size plus some book-keeping variables stored alongside.
split off a big enough chunk of the block for the current request, put the rest back in the free list
if no block is big enough, go back to the OS and ask for another big chunk
When a deallocation request comes in
read the header to find out the size
add the newly freed block onto the free list
optionally, see if the memory immediately following is also listed on the free list, and combine both adjacent blocks into one bigger one (called coalescing the heap)

You allocate a chunk of memory at the beginning of the program large enough to sustain its need. Then you have to override new and/or malloc, delete and/or free to return memory from/to this buffer.
When implementing this kind of solution, you need to write your own allocator(to source from the chunk) and you may end up using more than one allocator which is often why you allocate a memory pool in the first place.
Default memory allocator is a good all around allocator but is not the best for all allocation needs. For example, if you know you'll be allocating a lot of object for a particular size, you may define an allocator that allocates fixed size buffer and pre-allocate more than one to gain some efficiency.

Here is the classic allocator, and one of the best for non-multithreaded use:
http://gee.cs.oswego.edu/dl/html/malloc.html
You can learn a lot from reading the explanation of its design. The link to malloc.c in the article is rotted; it can now be found at http://gee.cs.oswego.edu/pub/misc/malloc.c.
With that said, unless your program has really unusual allocation patterns, it's probably a very bad idea to write your own allocator or use a custom one. Especially if you're trying to replace the system malloc, you risk all kinds of bugs and compatibility issues from different libraries (or standard library functions) getting linked to the "wrong version of malloc".
If you find yourself needing specialized allocation for just a few specific tasks, that can be done without replacing malloc. I would recommend looking up GNU obstack and object pools for fixed-sized objects. These cover a majority of the cases where specialized allocation might have real practical usefulness.

Yes, both stdlib heap and OS heap / virtual memory are pretty troublesome.
OS calls are really slow, and stdlib is faster, but still has some "unnecessary"
locks and checks, and adds a significant overhead to allocated blocks
(ie some memory is used for management, in addition to what you allocate).
In many cases its possible to avoid dynamic allocation completely,
by using static structures instead. For example, sometimes its better (safer etc) to define a 64k
static buffer for unicode filename, than define a pointer/std:string and dynamically
allocate it.
When the program has to allocate a lot of instances of the same structure, its
much faster to allocate large memory blocks and then just store the instances there
(sequentially or using a linked list of free nodes) - C++ has a "placement new" for that.
In many cases, when working with varible-size objects, the set of possible sizes
is actually very limited (eg. something like 4+2*(1..256)), so its possible to use
a few pools like [3] without having to collect garbage, fill the gaps etc.
Its common for a custom allocator for specific task to be much faster than one(s)
from standard library, and even faster than speed-optimized, but too universal implementations.
Modern CPUs/OSes support "large pages", which can significantly improve the memory
access speed when you explicitly work with large blocks - see http://7-max.com/

IBM developerWorks has a nice article about memory management, with an extensive resources section for further reading: Inside memory management.
Wikipedia has some good information as well: C dynamic memory allocation, Memory management.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js